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(57) Abstract 

The present invention provides a method for generating a 
representation of the extent of relatedness between at least two 
classes of cells. The invention also provides a method for generating 
a represent:ition of the correlation between a first class of cells and a 
second class of cells. The cotrelation reHects a change in the nature 
and amount of nucleic acids present in the classes. In these methods, 
the cells in each class are chosen from among cells of a given cell 
type, cells from a given tissue, and cells from a given organ. The 
methods establish similarities or differences between the classes by 
i defining a plurality of pairs of nucleotide subsequences, each pair 
consisting of a hrst subsequence and a second subsequence, and, 
in the nucleic acid of each class of cells, determining the presence 
of a fragment with the first subsequence at one end a^nd the second 
subsequence at another end and having a length .separated by the 
first and second subsequences, as well as a quantitation of the extern 
to which each fragment is present. The methods then determine the 
extent of rekuedness reflecting the similarities or differences amoni: 
the classes. The invention further provides display means displaying 
a lepre.sentation of the extent of relatedness between the classes of 
cells, and displaying a representation of the coiTelation between 
the first class of cells and the second class of celts. .Additionally, 
the invention provides a representation of the extent of relatedness 
between the classes of cells, and representation of the conelatioti 
between the hrst class of cells and the second class of cells. 
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WO 00/15851 , PCT/US99/21525 

GEOMETRICAL AND HIEI^\RCHICAL CLASSIFICATION BASED ON 

GENE EXPRESSION 



FIELD OFTHi: INVENTION 

This invention relates to representations of the extent of reiatedness between cells, cell lines, 
tissues, organs, or expressed sequences based on a genomic analysis of gene expression using sofUvare 
algorithm based analysis. 

RELATED APPLICATIONS 

This application claims priority to both United States Application Serial Number 



nied September 16. 1999. entitled "GcONICTRICAIwXNL) Hirjo\RCHICAL Cl.ASSfFlCATlON BASED ON 
GENE EXPRESSION", and United States Provisional Application Serial Number 60/101.009 filed 
September 17. 1998. entitled "PuvLOGENOMics AND Pi iarmac:ogeno\mc:s". which are incorporated 
herein by reference in their entirety. 

BACKGROUND OF THE INVENTION 

The rapid dev elopment of genomics and proteomics in recent \ ears has led to a burgeoninu of 
applications making use of the new information provided. A significant area in which such information 
has been put to use is in the grouping and characterization of pathological states according to the 
differential expression of genes in such states. A coro!lar> application is in grouping and characterizing 
the therapeutic effects of known or candidate pharmaceutical agents used in treating various pathologies. 
Algorithms employing a variety of statistical procedures have been emploved to create heuristic displays 
of the information obtained from sucii analyses. These displays include large two dimensional, or even 
higher dimensional arrays in which the elements are coded, for example by false color coding, to 
represent a particular experimental result. Alternative displav s include those in which the experimental 
data IS used to generate cladistic or radiating tree structures as a representation of reiatedness. 
Furthermore, it is also possible to use similar methods to group expressed sequences accordinu to 
panerns of co-expression over several different biological states. 

For example, a system of cluster analysis for genome-wide expression in the veast 
Saccharomyccs cerexisiac and in primary human fibroblasts has been presented by Eisen etui {Proc. 
Sad. Acad Sci. USA 95: 14863- 14868 ( 1 998)). In the veast work. DNA microchip arrays carrying 
essentiallv every ORF from this organism were u.sed. nitTerential expression was studied bv varvinu the 
phvsiological state, including the diauxic shift, the miroiic cell division cvcle. .sporulation. and 
temperature and reducing shocks. The human tlhrobkisis ucre stimulated \vith serum following serum 
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starvation, and examined using a microarra\ uith 0.800 cDNAs representing approximately S.600 
distinct luiman transcripts. Additionally, a further independent variable in these experiments is the tmic 
at which an assay point was taken. Data reflecting the differential gene expression in the various studies 
were analyzed using pairwise average-linkage cluster analysis (Sokal et oL. Univ. Kans. Sci. Bull. 
38: 1409- 1438 ( 1 958)), which was used to compute a dendrogram that assembles all elements into a 
single tree. 

Colon adenocarcinoma from 40 tumor samples were compared with 22 norma! colon tissue 
samples using Affymetrix DNA chips to which sequences from human cDNAs were bound (Aion e/ .//., 
Proc. Natl. Acad Sci. USA 96:6745-6750 (June 1999)). 3.200 fuIMength human cDNAs and 3,400 ESTs 
are represented in sets of 25-bp fragments, as well as such sequences containing a single base mismatch 
in the center of the sequence. The gene expression in both the tumor tissue samples and the normal 
colon samples, was assessed by hybridization. The statistical significance of the correlation between 
genes was assessed by calculating pairwise correlation coefficients. The clustering of the expressed 
genes was evaluated using an algorithm based on deterministic-annealing ( Rose c/ al., Phys. Rcw Lett 
65:945-948 (1990): Rose. Proc. IEEE 96: 22 10O239 f 1998)) to organize the data in a binary tree. Data 
are pre.sented as a large two-dimensional color coded array, with genes displaxed along one dimension 
and tissue samples along the other: artificial color values are assigned at each array point to indicate the 
extent of expression in a third dimension. Clustering analysis reveals patterns in the color distribution 
within the array which is disrupted when various randomization procedures are applied. The ciusterin- 
of the genes in die data set reveals groups oU^n^^s whose expression is correlated across tissue types. 
The algorithm separated the tissues into distinct clusters. 

Pharmacological effects of compounds actually used or being screened for use in cancer 
chemotherapv were analyzed by cluster anal>sis at the National Cancer Institute (VVeinstein et al.. 
Science 275:343-349 (1997)). More than 60.000 compounds uere screened against a panel of 60 human 
cancer cell lines. A 50% growth-inhibitor\' concentration of a compound in a given cell line, when 
anaUzed across all cell lines, provided detailed information on mechanisms of drug action and drug 
resistance. Patterns of activity were first anal>zed b> the COMPARE algorithm (Paull et al., .1. Natl. 
Cancer Inst. 81:1088 (1^89); Jayaram. Biochem. Biophvs. Res. Cofnniun. 1 86: 1 600 ( 1 992); Paull eiai. 
In: C.ANCTK CillAiOTiiHI^APElTlc- AGENTS. Foye (ed.). American Chemical Societ>-. Washington DC. 
1993.pp. 1574.|581:Boydc'/^//..Z>n/aDc'v. 34:91 (1995)). The procedures developed reK on 
three databases, an S database characterizing structural information on the candidate compounds, an A 
database related to the 60 cell tmcs and a T database including information on molecular targets of 
action. In an example of the results of the analysis, a three dimensional arrav displaving compounds 
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versus tan^eis. with a false color code proviclini: a correlation coelTicieni in a third dimension for each 
position ill the array, was developed. 

Certain problems arise upon consideration of the procedures currently in use for the correlation 
and clustering of genome-derived attributes. Use of DNA microchips inherently limits any analysis to 
the sampling of the DNA sequence fragments employed as the capture probes bound to the chips. 
Detection of any DNA fragment which does not hybridize with one of the capture probes is not possible, 
so that positive results are potentially lost. Additionally, a mutation or other allelic polymorphism may 
not bind to the capture probe under conditions of moderate or low stringency, so that again information 
relating to a positive result may be lost. 

For these reasons there is a need for methods of genomic statistical analysis based on more 
comprehensive accessibilit> to the genomes of the organisms being studied. Furthermore there remains a 
need for wavs of presenting the information obtained m genomic analyses of relaledness of genes, and in 
genomic analysis of response to actual or candidate pharmaceutical agents, that includes information 
gleaned from a comprehensi\e access to the genomes in question. The present invention addresses these 
needs, for use is made in the invention of partial and full genomic sequences available from a large 
number of sequence databases in clustering anaKsis of the components appearing as independent 
variables in a particular stud> . 

SU.VIMARV OF THE INVE.N'TION 

The invention provides novel methods of geometric and hierarchical classification between at 
least two classes of data sets. Data sets may represent ceils, nucleic acid sequences. pol> peptide 
sequences, or the like. The invention is able to utilize both standard DNA microchip arrays and 
non-DNA chip technology to provide input information on nucleic acid moieties of the specified classes 
of cells. The data are then treated in various wav s to pro% ide repre.sentations of relatedness that are 
readily interpretable by the human eye. The invention additionally provides novel methods for 
generating a representation of the correlation betw een at least two classes of cells, the correlation 
reflecting any changes in the composition and amount of nucleic acids present between the classes. 

The cell classes mav be from different sources for use ni comparing differences between variou. 
cell populations. These differences include, but are not limited to. species differences, tissue differences, 
disease state differences, and drug treatment differences. Computer algorithms anahze input data 
retkcting differences between chosen cell classes ..n.l repiccnt them in a meamn-ful wav. 
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Prior to the present inveiuioii. input information was obtained only using DNA-chip teclinologv 
to analN ze the nucleic acids of the cell classes lo be compared. Drawbacks to these methods are that 
idetuirier sequences need to be already known and isolated, chip technology has si?e limitations related 
to the number of the nucleic acids immobilized on the chips, and, once the chips were manufactured, it is 
viitually impossible to expand nucleic acid parameters. The invention provides the use of 
GencCalling'^^V a non-DNA chip technology, to assay differences between input cell classes. An 
unexpected result is that GeneCalling''^ is able to provide sensitive comparisons between disparate 
groups above, thereby sidestepping the limitations inherent in the use of DNA chip technology when 
assaying input nucleic acid population. 

The invention provides a novel method for generating the extent of relatedness reflectini: 
similarities or differences in the presence and quantitation of the fragments among the classes by 
calculating a distance that reflects the amplitude of a difference vector. In a significant embodiment of 
this method for generating the representation of relatedness. the extent of relatedness is provided by 
generating a tree structure reflecting the relatedness between any two classes. The branches of the tree 
structure reflect the difference vectors and are ramified from nodes. 

The iin ention also provides a novel method for generating a representation of the correlation 
between classes of data sets. In a significant embodiment of the method for generating a representation 
of the correlation, the correlation is related to a set of orthonormal eigenvectors. In another siizniHcant 
embodiment of the method for generating a representation of the correlation, the representation is a 
cluster diagram or a dendrogram, and includes a tree structure reflecting the relatedness of the pathways 
involved in the biochemical or ph> siological response to a dilTerence between cells of the two classes. 

The invention additionalK relates to providing geometrical representations of differences 
between classes of data sets. The geometrical representations encompass. b> way of nonlimitina 
example, principal coinponent analysis and principal factor analysis, as well as reduced dimensional 
representations derived from them. The geometrical representations are based on differences determined 
between classes of cells using an\' method of analy;^ing for the presence of genes, nucleic acids, or 
fragments thereof, including nucleic acid microchip arrays and differential display of expressed oenes or 
nucleic acid fragments. 

The inv ention also provides dispia\ means for displa\ing the representation of the extent of 
relatedness. the correlation, and the geometrical representations of difTerences between classes of data 
sets, as well as the representations theniselve^. 
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BRIEF DFSCUIPTION OF THE DRAW ING 

Figure i is a schematic flow diagrani iiiustrating the principal steps involved in generating the 
various representations of the invention starting from a set of subsequence-selected fragments found for 
the samples. 

5 Figure 2 is a schematic flow diagram illustrating tlie primary steps involved in carryiniz out a 

principal component analysis. 

Figure 3 illustrates hierarchical clustering of four drugs with sterile water as an outgroup. 

Figure 4 is a graphical projection of drug treatments and controls onto principal factors. 

DETAILED DESCRIPTION 

10 The present invention relates to methods for preparing representations of the relatedness between 

cells of any two or more different classes of cells. The classes broadly encompass cells arising in animal 
and plant organisms, the cells further being normal cells or cells in a diseased state, including tumor 
cells. They further include cells that have been treated v\iih a putative pharmaceutical auent. The 
representations are obtained using experinieiual data that provide size and sequence information on 

I 5 nucleic acid fragments derived from each of the cellular sources. Ihe fragments may be prepared from 
the nucleic acid content of the cells in each class in any of several vva>s. For example, in a particularlv 
important embodiment- they may be subjected to digestion by particular pairs of restriction 
endonucleases; aliernativel>. in another important embodiment, cell extracts may be subjected to 
amplification using specially designed primer oligonucleotides. The present invention also relates to 

20 methods for preparing representations of the relatedness in terms of co-expression between the nucleic 
acid fragments so produced. 

Tiie invention further relates to the representations provided bv these methods, and to display 
means on which such representations arc dispiaved. The methods for preparing the fragments, such as 
the use of restriction endonucleases or the application of amplification primers, are chosen to prov ide 

25 subsequence information relating to the ends of the resulting fragments, while size determination 

provides the length of the fragment. In certain applications of these types of information, the size and 
subsequence results can optionallv be scanned against databases prov iding known nucleic acid sequences 
in order to provide the identity of one or more candidate fragments of known complete nucleic acid 
sequences hav ing the correct length and terminal subsequences ( U. S. Patent No. XS7I.697; Shimkets ^7 

30 ill. 19^)0 Suture Bioiecluiolo^ 1 "':798-803). 1 \\\> database K^'k-up step is not a required feature of the 
current invention. Tor this reason, tiie present reprcsLiuai i.mi- :\\k\ methods are more comprehensive and 
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more informative of genomic variations amon- the samples tlian those currently known. As described in 
the Back-round of the Invention, currently known procedures are restricted in their compreiiensiveness 
to those nucleic acid fragments thai are applied to DNA microchips as probe sequences in a given 
procedure. Except for a narrowly limited set of model organisms with known genome sequence, the 
number of such probe sequences is considerably fewer than the number of known nucleic acid sequences 
available in sequence databases and employed in the present invention. Furthermore, even for fully 
sequenced genomes, genetic variation is not adequately probed with existing DNA microchips. This 
distinction characterizes an important advantage of the instant invention. 

The invention additionally relates to providing geometrical representations of differences 
between classes of cells. The geometrical representations encompass, by way of nonlimiting example- 
principal component analysis and principal factor analysis, as well as reduced dimensional 
representations derived from them. The geometrical representations are based on differences determined 
between classes of cells using any method of anai> zing for the presence of genes, nucleic acids, or 
fragments thereof, including nucleic acid microchip arrays and differential display of expressed genes or 
nucleic acid fragments. 

As used herein, "sample" relates to a particular experimental state for uhich all the variables 
being studied in a project arc held fixed. By way of nonlimiting e.xample. if a variable is a class of cell, 
the "sample" refers to a particular cell type; if a variable is the subsequence pairs employed in the 
project, a ^-sample" reters to a particular subsequence pair: or if a variable is a set of putative 
pharmaceutical agents, a ".sample" refers to a particular agent from the set. As used herein, 
-representation" relates to any graphical, visual, or equivalent non-verbal display that provides an image 
of the results obtained according to the methods of the present invention. More specilically. a 
"representation" of the invention is obtained by transforming the quantitati\ e results <:athered by 
e.xperiments underlying the invention. [Examples of such data niclude. by uay of non-limiting example, 
differential gene expression across classes of cell, and/or across a set of putatixe therapeutic agents, 
and/or equivalent r>pes of experimental parameter. 

in important embodiments, a representation of the inv ention is generated b\ algorithms executed 
in a computer and is suitable for displa>' on a display means, such as a display screen or monitor, 
employed in the operation of the computer. The representatioii is also suitable for storing in a storage 
module or data archive of such a computer. It is still further suitable for printing from the computer onto 
a medium such as paper or equivalent phvsical medium, and \or recording it onto a portable storage 
medium, including, for example, magnetic media. C I) K( )\1> anJ equivalent storage media. As used 
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herein, "display means" includes any of the objects and media identified above in this paragraph, as well 
as equivalent apparatuses and objects suitable for displaying the results of computational processes for 
visual inspection. 

As used herein, "extent of relatedness" is a characterization according to methods of the present 
invention of a degree of similarity or a degree of non-similarity between any two members of the same 
type of element: in panicularly important embodiments, the type of element may be classes of cells. 

As used herein, a "putative pharmaceutical agent" relates to a chemical compound or a 
composition comprising at least one chemical compound which is a candidate for being a therapeutic 
agent. Any such therapeutic agent may be used in treating a mammal suffering from a disease or a 
pathology. In treating the mammal with the therapeutic agent it is intended to attenuate the symptoms 
and/or the underlying causes of the disease or the pathology, to ameliorate the symptoms and/or the 
underlying causes, and/or to contribute to a cure of the disease or the pathology. Non-limiting examples 
of a putative pharmaceutical agent include an agent drawn from a chemical compound library: an isolate 
from a natural source: a compound synthesized specifically as a putati\ e agent: or a substance deri\ ed or 
obtained using the practices of genetic engineering and recombinant nucleic acid technology such as a 
recombinant protein, a fragment of a recombinant protein, a recombinant pol> peptide, a fragment of a 
recombinant polypeptide, a recombinant peptide, or a nucleic acid including, for example an 
oligonucleotide intended as an antisense agent, and a recombinant gene intended for administration as a 
gene therapeutic agent. 

As used herein, a "fragment of a nucleic acid relates to a contiguous portion originating from 
the genomic or cDNA-derived nucleic acid from a class of ceils. T he contiguous portion includes at or 
near each end a target subsequence defined according to the operational procedures disclosed hereni. and 
includes all nticleotides in the sequence of the fragment bounded by the two target subsequences. The 
nucleotides between the two target subsequences, together with the subsequences themselves, deline a 
"length^' of the fragment, as used herein. The target subsequences are identitled. for example, by 
contacting the nucleic acid from the cells with a specillc pair of re.striction endonucleases, or with a 
specific pair of oligonucleotide primers, and in equivalent ways. 

The information used in the present invention is obtained from experiments providing the results 
of differential gene expression u herein the difference relates to an experimental state and a reference 
state. Commonly a reference state refers to a normal, or an unperturbed, or a non-pathological class of 
cells. An experimental state may relate ro a certain set ofconditions applied lo one class of cells, and the 
corresponding reference state then relates to the same set ofconditions applied to a second class of cells. 
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An experimeiual state may also relate to a class of cells in the presence of one or more putative 
therapeutic agents, in which case the reference state relates to the same class of cells in the absence of 
any putati\ e therapeutic agent. An experimental state nun furthermore be obtained from a class of cells 
that is of interest in a particular set of circumstances. This includes cells of a given cell type, cells from 
a given tissue, and cells from a given organ, and further includes cells that may be noncancerous or 
cancerous. 1 ypes of cell encompassed within tlie present invention include, by way of non-limiting 
example, endothelial cells, mesothclial cells, and epithelial cells. Tissues and organs included within the 
present invention may be. by way of non-limiting example, lung, heart, skeletal muscle, smooth muscle, 
brain, central nervous system, peripheral nervous system, stomach, liver, kidney, reproductive tissues 
and organs, skin, and bone. Cancerous cells include, by way of non-limiting example, cells from prostate 
cancer, breast cancer, colon cancer, lung cancer, lymphatic or hematopoietic cancers, and also include 
ceils obtained from tissue biopsies or from cell lines in the National Cancer Institute human tumor cell 
line panel. The cells subjected to analysis in the present invention may also originate from plants, yeast, 
fungi, and other ta.xonomic groupings. 

The methods of evaluating the extent of relatedness between classes of cells, for example, 
between a first class of cells and a second class of cells, are founded on evaluating the extent of 
relatedness of the expression of particular genes bctueen the cells of the tuo classes. In a preferred 
embodiment of the invention, similarities and differences in the susceptibility of the nucleic acid present 
in the cells to digestion by specific pairs of restriction endonucleases are determined, according to the 
methods of the present invention, by procedures that are disclosed in detail in co-owned S. Patent No. 
5,871.607 to Rothberg et uL. and in Shimkets ct a/. 1990 (Nature Biotcchnoiog\ 1 7:798-803). both of 
which are incorporated herein by reference in their entireiv. 

Briefly, for any experimental state of a class of cells, the nucleic acid content of the cells, 
preferably in the form of a preparation of cDNA from the cells, is subjected to restriction endonuclcase 
(*^RE") digestion by specific pairs of endonucleases. Each member of the RE pair is chosen to optimize 
the likelihood that a restriction fragment resulting from the nticlease digestion will be a unique fracmeni. 
In an important implementation of this method, the restriction nuclease digestion is carried out on cDNA 
prepared from the ceils of the class in the given experimental state. This implementation leads lo 
emphasis on genes that are expressed in tlie experimental state, many of which may be characteristic of 
the given experimental state and be more poorl\ expressed, or not expressed at all significantlv. in a 
different experimental state. A large number of specific pairs of nucleases may be employed. 
Alternativels. expression of a gene ma> be repres.scd m a characteristic uay in a given experimental state 
and be expressed at a higher level, such as at a constitutive level, in a different experinienial state B> 
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way of non-limiting example, several pairs of restriction nucleases that may be employed in 
implementing the present invention are disclosed in L'. S. Patent No. 5,871,697. 

in an alternative embodiment, the extent of relatcdness n.ay be obtained by amplincation 
fragment length polymorphism analysis ("AFLP"). Brieny. anipliticatiun of the nucleic acid content of 
the class of cells being examined is subjected to a primcr-dcpendcnt amplincation procedure in which 
any of a set of primer pairs is used to initiate amplilication. Amplitlcation procedures are described in 
considerable detail in. for example. Innis et cl., PCR PROTOCOLS. A GuiDC TO Mkthods AND 
APPLICATIONS. Academic Press. New York (1989). and Innis ei uL. PCR Stratkgies. Academic Press. 
New York (1995). The primers of each primer pair are different from each other, and reflect different 
subsequences that are the object of the amplification process. Amplification may proceed by any 
procedure, including polymerase chain reaction, known in the field of molecular biology. In AFLP. the 
length of an amplicon found in a given experimental state differs from the lengtii found in a different 
experimental state. This may arise, fore.xample. if the given experimental state arises from a mutation 
that occurs in a subsequence recognized b>' a primer used in the amplification reaction. It may also arise 
15 from a deletion from, or an insertion into, the nucleic acid of the cells in that state. 

The experimental and computational procedures that may be employed to generate the 
representations of the present invention are described generally below. 

Measurements 

At the outset, the gene expression levels are determined expcrinientall> , This can be done, in a 
20 preferred embodiment, by following the general protocols of differential expression using restriction 
endonucleases (U.S. Patent No. 5.871 .607). F or each pair of restriction enzy mes and each biological 
sample, a pool of fluorescently-labeled DNA fragments is generated. Electrophoresis is then performed 
to separate these fragments based on size, and an intensity, designated a.s I,„(x). where s labels the 
sample, i.e.. the cell class: r labels the restriction enzyme pair. U:. the gene fragment; t labels the trial, 
and X is the length of the fragment as determined by electrophoresis, is detected. The length x ma> be 
either a continuous index or a convenient discretization. As an example, the resolution of the 
electropherogram may be set to a discretization of 0. 1 nucleotide ("nf). Commonly three independent 
trials are performed. A mean signal l,,(x) is then obtained by averaging over the n, trials. 



25 



l<,(x) = (l/n,)I, l,,(x)I, 



(I) 
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Next, lengths \ for each restriction enz>me pair r where some of the samples have a significont 
ditYerence in measured intensity are identified. Such a difference is determined with respect to ce!I 
types, or with respect to the presence vs. the absence of a putative pharmaceutical agent. Labeling the d"' 
such difference d, the values I,,(x ) are then collected. Any of several methods for identifying 
significant differences may be employed, some of w hich are outlined herein. For example, an important 
method involves the following computational steps: 

1. The mean lr(x) — Lr(x) is evaluated. 

2. Ail positions, i.e. lengths, where, for at least one sample, ~ \^(\) is larger than some 
threshold value, are marked. 

3. The largest value of I,,(x) - I,(x), determined as a difference between a sample state and 
the mean for restriction enzyme pair r. is foimd and the length x. indexing the difference, 
is marked. 

4. Step 3 is repeated for succeedingly smaller values of the intensity difference. If the 
length X thai marks the current largest difference is within a distance w from the lenuth 
of a previousl\ identified difference, the current difference is skipped and the next 
smaller difference is considered. 

5. Step 4 is repeated until there are no more differences to consider. 

Another method involves (mding differences that meet a statistical criterion. A particular 
example of such a method involv es the computational steps of 

1 . defining a set of sample classes and assigning each sample to a particular c!as:> c: 

2. for each restriction enzyme pair r and length .x. evaluating the F-statislic for the set of 
measurements f^lx) and the classes c to which samples are assigned, therebv providing 
the probability p,(x) that any differences between sample classes may be explained b\ 
random variation (Sec, for example. P. Hinton. Statistics Explained, Routledgc 1^^5): 

3. ordering the probabilities p,(x) from smallest (most significant) to largest (least 
significant); 

4. optional!) truncating the list at some threshold value above which differences are 
no longer considered significant (accepted values are p,(x) = 0.01 to 0.05): 

5. finding the smallest value of p,(xi and marking the length \ ab a difference for restriction 
enz\me pair r. 
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6. repeating step 4 and determining whctlter the length \ that marks the current difference 
is in a region that is within a distance w of a previous difference, in which case the 

cm rcnt difference is skipped and the next smaller distance is considered: and 

7. continuing until there are no more differences to consider. 

These exemplar} computational procedures provide a set of measures of intensity for the class 
of cells in sample s at difference d. 

Distances 

For hierarchical clustering, a distance D,, may be defmcd as the distance in vector space 
bet\seen pairs of samples s and s'. A variety of methods for calculating D,, are available. Some 
examples, which are intended as being nonlimiting. are provided below. 

D,, as a scaled correlation function; 

1 . One calculates - ( l/nj I. I,^ and g, = [( 1 /'njl, (L, ^ ii,y~f\ If data is missing, for 
example no measurement of I,, exists for some sample s. that sample is excluded from the 
sum and n, is reduced by 1 . 

2. One calculates J,,, = (I,, - n.) ' a. If data is missing for I,,;, then J^, is detlned as J,j = 0. 

3. One calculates - { i n,) i;, J,, and c, = (( 1 Uj)!, (J,j - . 

4. One calculates ^ (.f,^j - a,) ' g, . 

5. One calculates the covariance matrix S,^ = ( i/n^) I^, K,jK.;, 

6. One calculates the correlation matrix C\. - S,. f S,,S, 

7. One calculates D,, =[2-2 - . 

D,, as a Euclidean distance; D,^ = [ Y.,, (K^, - f . 



D,, as a Pearson distance; D., - [ Ij (l,.- L ur ]"Svhercc, is defmcd in step 1 of scaled 
correlation function above. 
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D,, as a pairwise Pearson distance: 

1 . One calculates the covariance matrix S,,- - { |/n,)[ [J^.^ _ (v ) (v^ ) / y 

2. One calculates the correlation matrix S^, / [ S S 

3. One calculates D,, = [ 2 - 2 C,,- 

D^, as a Mahalanobis distance: 

1 . One calculates the covariance matrix S,,- - (I, I^^ ) ~ i^^) (v j^^.^/j^^ 

2. One calculates the correlation matrix C^, - S,, / [ S,, , Y ' and its matrix inverse C ',, 

3. One calculates D,,. - [ I.,, ( - I.,) C'^.^. (1^,. - i^,,.) f ^ 

It is contemplated that other distance methods known in the an ma\ be used in the invention, 
such as Spearman correlation, and the like. Other methods knoun in the art can be found, for example 
and not be means of limitation, in V. Mardia, J. T. Kent, and J. M. Bibb> . Ml'MIVariaTU Analysis. 
Academic Press. New York. 1979. 

Hierarchical Clustering 

The distances can be used to perform hierarchical clusterinu of the samples. A general algorithm 
for clustering is described below. 

1 . Each sample s is assigned its o\\ n initial cluster c. 

2. One calculates all the distances between pairs of clusters and fmds the smallest distance. 
These tw o clusters are joined into a single cluster and the number of clusters is decreased 
b>' 1 . 

3. Step 2 is repeated imtil only a single cluster remains. 

In order to implement this algorithm, a method lo calculate die distance between pairs of cluster^ 
is also required. Some nonlimiting examples of such calculations, using well-known methods, are 
indicated below. 

Nearest neighbor, single linkage: The distance between clusters c and c' is the smallest distance 
D^,- . where s ranges over ail samples in cluster c and s" ranges o\er all samples in cluster c*. 

Unueightcd pair group meiiiod using arithmetic averages ( L PGMA). also knoun as averaae 
linkage: The distance between clusters c and c' is il., D,, ) '(n n^ ) where s ranges over all samples in 
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cluster c. s* ranges over all samples in cluster c*. n, is the luiniber of samples in cluster c. and n, is the 
number of samples in cluster c\ 

Furthest neighbor, complete linkage: The distance between clusters c and c' is the largest 
distance D,, where s ranges over all samples in cluster c and s* ranges o\er all samples in cluster c\ 

Other distance-based iiierarcliical clustering methods are uell-knovvn. Sec. for example. Wcn- 
flsiung Li, MOLCCliLAR EVOHJIK^N. Sinauer Assoc, 1997. 

Software packages are available to perform the clustering and display the results. See. for 
example, Phylip. Joe Felsenstein. http:/. -evolution. uenetics.washinuton.edu for clustering, and Treeview . 
Rod Page, hnp://taxonom\ .zoolouv. ula.ac.uk ''rod/treeview .htrnl for display. The source code for the unit 
within Phylip employed for the clustering, and the downloaded executable tiie of Treeview for Windows 
95 and Windows NT, as well as a manual for Treeview. are available from the owner of the present 
application. 

Tno-Diniensional Clu.stcring 

It is also possible to cluster the distances, rather than clustering the samples. One simply 
exchanges the roles of the samples and differences in the equations above. Furthermore, it is possible to 
perform clustering of both samples and differences, and then to display the measurements 1.^ in which 
both samples and differences are presented in cluster order. 

Principal Component Analysis and Principal Factor Analysis 

Principal component analysis is described in standard texts. See, for example. Mardia, Kent, and 
Bibb\. To perform principal component analy sis, one begins with a correlation matrix C\, as defined 
above, in the section "Distances". (.Alternatively, one could use the co\ariance matrix S,J. Eigenvalues 
and eigenvectors, defmed such that C\, g, , = aig,. , where the i^^ eiuenvalue is a; and its eiuenvector is 
are calculated. The eigenvalues are ordered from largest to smallest: a, > a. > . > a, . To obtain a 
reduced dimensional depiction of the samples, a number of desired dimensions k is chosen. Then, in k- 

dimensionai space, sample s is represented as the point (g„, g^, g^, ). Samples that are close in the k- 

dimensional space ha\e similar expression profiles and ma\- be considered to be related. 

.'\s an alternative to using the correlation matrix as the starling point for principal component 
analysis, it is possible to calculate principal components using the inner product matrix from 
multidimensional scaling defined as 

R= HC H r) 
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where C is the correlation matrix. H is the centering matrix vsilh diagonal elements given by 1 - (1/n) and 
off-diagonal elements - ( 1/n), where n is the nim»ber of items being correlated. (See, for example, 
Mardia. Kent, and Bibbv. Multivariate Analysis, and Arkin. Shen. and Ross. Science 277: 1275 (1997)). 
The k''' principal component is then the k"' eigenvector of B normalized to unit length and ordered by 
decreasing eigenvalue and the k'*" principal factor is obtained by scaling the eigenvector by X.J The 
projection of sample s onto the k"' principal factor is the element of the factor for rou s. The 
components or factors are ordered from 1 (corresponding to the most informative) to n (corresponding to 
the least informative). By using .some, but not all. of the components or factors, the samples can be 
represented in a small-dimensional geometric space. Furthermore, the amount of information retained in 
the representation can be related to the eigenvalues of the components that are used (See Mardia. Kent, 
and Bibby). 

A centered inner product matrix B appropriate tor principal component or prinicpal factor 
analysis can also be obtained from an> distance matrix D,, as 

B = HAH (3) 

where 

A„ --1/2(D, )-. (4) 

To perform principal factor analysis, factor i is defmed as h„ = a/" g„ where, as before, a, is the 
eigenvalue of the i''' eigenv ector g„. An onhonormal rotation matrix G (I. 0,^0,,^ is 1 if i = k and 0 
otherwise. det(G) = -^1 ) is introduced and the factors are rotated to obtain rotated coordinates for the 
samples. Thus, to obtain a k-dimensional representation of the locations of the samples, the following 
operations are performed: 

! . One calculates the correlation matrix C^, or the covariance matrix S^, where s and s' 
label individual samples. 

2. One calculates the eigenvalues a, and eigenvectors g„ for the matrix, with a, > a, >...> a,. 

3. Un rotated factor loadings h,; ~ a, '^ g,, are detlned. 

4. The first k factor loadings and an onhonormal rotation matrix G are selected. The]''' 
coordinate of sample s in the rotated space is I, h,, G,-, 

The rotation matrix G niav be optimized according lo standard criteria. See. for example. 
Mardia, Kent, and Bibb> . Ch. 9.6 on Varimax rotation, supm. I he rotated axes represent factors that 
iiitluence the observed measurements for the .sampk- 
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In implemenlinu the ineihods of the present invemion, these operdtions may be sequehtialty 
combined in any of several ways according to the intended display, i.e.. tiic nature of the relatedness that 
is intended to be shown. 

Also, the information from the principal factors can be used to help filter the experimental noise 
from the correlation functions. For example, it is possible to select a cut-off principal factor] < n, then 
compute distances and correlations between samples based on their representation in the j-dimensional 
principal factor space. 

As a non limiting example of the computational procedures that may be employed in the present 
invention, a schematic overview of procedures tliat may be adopted is presented in Figure I . The 
experimental results represent the sample-dependent and selection-dependent intensities obtained in an 
experiment, arrayed in a measurement matrix. In the implementation shown in Figure 1. the difference 
bands having various, defined, nucleotide lengths are arrayed as the colunins of the matrix: they are 
obtained in various experiments tiiat are selected using different members of the sets of subsequence 
pairs. The samples represent the classes of cells, or cells treated with a set of putati\e pharmaceutical 
agents, or analogous sample sets, and are arrayed as the row s. 

The values arraved in the measurement matrix ma\ then be subjected to correlation analysis to 
provide either direct sample correlations or correlations of differences. The measurement matrix can 
also be subjected to a calculation providing a \ectoral distance between samples: such a sample distance 
may also be obtained from the sample correlation result. 1 he distance \ ector can further be subjected to 
a linkage analysis to provide hierarchical clustering of the samples. .Additional 1\. the correlated samples 
may be subjected to principal component analysis providing the principal factors contributing to a state 
or to a difference. 

A nonlimiting e.xample of the wav in which a principal component analysis ma>' be carried out. 
using methods described herein, is presented in Figure 2. The correlation matrix or the centered inner 
product matrix described above is subjected to appropriate operations to provide the principal 
components and the principal factors, based on their eigenvalues and eigenvectors. Advantageously a 
reduction in the number of dimensions employed in the number of eigenstates may provide a filtering 
effect, reducing the noise in the vector distances calculated. 

The representations provided in the present iinention fmd use in various applications of 
genomics in the biological and medical fields. Lxtcnts of relatedness and correlations provide rapid 
overviews of enzvtnatic reactions, metabolic paihv\a>-i- and piiv >iological effects that become 
distinguished when comparing states. W hen a pathoiogi..:ii >taic ts compared with a normal state, for 
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example in a mamnuiL and especially in a human, the displav of distinguished pathuays is instructive in 
the development of tiierapeulic approaches and/or therapeutic agents for the treatment of the pathological 
state. When a putative pharmaceutical agent is compared to a state that omits the agent, or when one 
such agent is compared with another, important information is provided relating to the metabolic 
reactions induced by or undergone by the agent or agents, leading to optimal choice of such agents. This 
information may also provide leads to the development of novel pharmaceutical agents. If the uenomc 
being studied is a plant genome, such as the genome of an important crop plant, analogous principles 
apply. 

Nucleic acid assays 

The present invention provides a method for generating a representation of the extent of 
relatedness between at least two classes of cells. In this method, the cells in each class are chosen from 
among cells of a given cell tvpe, cells from a given tissue, and cells from a given organ. Generation of 
nucleic acids from the cell samples of choice may be as described in the GcncCalling"' ^' methodolo-\ 
See U.S. Patent No. 5.871.697. The method includes the steps of: (a) defining a p!uralit\ of pairs of 
nucleoiide subsequences. each pair consisting of a first subsequence and a second subsequence: 
(b) isolating the nucleic acid of each class of cells and assaying for the presence of a nticleic acid 
fragment with the first subsequence at one end and the second subsequence at another end and havina a 
length separated by the first and second subsequences, and quantitating the extent to which each 
fragment is present; and (c) determining the extent of relatedness reflecting similarities or differences in 
the presence and quantitation of the fragments among the classes using software algorithm prourams 
known in the an. 

One imponant embodiment of this method, i.e.. detenninmg the presence of the fragments and 
quantitating the amounts present, as described in step (b) above, is carried out by a process that includes 
the steps as follou. First, samples of the nucleic acid from the cells of each class are digested with a 
plurality of specific pairs of restriction endonucleases ("REs"). Each sample is treated by one RE pair, 
where one RE of the pair targets the first subsequence described in step (a) above, and the second RE of 
the pair targets th.e second subsequence, with each digestion providing specific restriction fragments. 

Second, double stranded adapter ONA molecules are hybridi7ed to the fragments. Each adapter 
DNA molecule comprises: (/) a shoner strand. preferabK having no 5' terminal phosphate, consisting of 
a first and second portion, the first portion being a region at the end that is complementar\ to the 
overhang produced by one of the RE> of the given pair and a portion hv bridizable to the opposite 

longer strand of the adaptor, and {/Via longer straiid. prcfcrab!;. ha\ mg no 5" terminal phosphate. 
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consisting of a first portion at its end coitiDlemeniarv t/A fi,„ i 

cno compiementar> to the abovc-nicnlioiied second portion of the 

shorter strand, and an optional second portion at its S' end eon.prising a nnique region not hybridi.able 
.0 any sequence present in the original sample popt.lation. See U.S. Patent No. 5.871.697 The longer 
strand is optionally labeled with nuorochrome 208. although an> DN.A labeling svsten, that preferably 
allows multiple labels to be simultaneously distinguished is usable in this invention. See eg. Ausubel 
«/. CVRRf.m PROTOCOLS IN MoL.ctii.AR BiOLOCV, John Wiley & .Sons. New York. NY, 1993. 
Third, output signals fron, each ligated fragment are detected for each sample population so 
treated. Each ligated fragn,ent generates output signals that characterize (a) the presence of the oiven 
subsequences corresponding to the RE pair used in a particular run. (b) the length bet^veen the two 
subsequences corresponding to the two REs employed in a given run. and (c) the quantitation of the 
relative amounts present of each fragment so generated in a given run. 

Optionally, a nt.cleotidc sequence database may be searched for sequences that are predicted to 
produce, or alternatively . not produce, the one or more output signals generated by the nucleic acid fron, 
the cells ot each class, given the paran,eters described above. The analvsis methods comprise firs, 
selecting a database of DN.-X sequences representative of the DNA sample to be analvzed. second usin- 
th.s database and a description of the experiment to derive the pattern of simulated smnals that would be 
generated, contained in a database of simulated signals, that will be produced by DNA fra-ments 
generated in the experiment, and third, for any particular detected signal, using the pattern^or database of 
sunulated signals to predict the sequences in the original .sample I.kely to cause this signal. Further 
analvsis methods present an easy to use user interface and pernm determination of the sequence, actualK 
causing a signal in cases where the signal may arise from multiple sequences, and perforn, statistical 
correlations to quickly determine signals of interest in multiple san.ples. A sequence from a searched 
database is predicted to produce the one or more output signals when thai sequence has both (a, the same 
length between occurrences of target nucleotide subsequences as is represented bv the one or more 
output signals, and (b) the same targe, nucleotide sub-sequences that are represented bv said one or n,orc 
output signals, or target nucleotide sub.sequences that are men,bers of the same sets of targe, nucleotide 
sub-sequences represented by the one or more output signals. 

A first analysis method is selecting a database of DNA sequences representative of the sample to 
be analyzed. In the preferred use of this invention, the DNA .sequences ,o be anaKzed will be derived 
from a tissue sample, typically a human sample examined for diagnostic or research purposes In this 
u..e. database selection begM,s wi,h one or more publicly available databases uhich comprehensivclv 
record all obseryed DNA sequences. Such databa.ses aiv C.enBnnk from ,he National Center for 
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Biotechnology Infonnation (Bethcsda. Md.). the EMBL Data Librarv at the European Bioinformatics 
Institute (Hinxton Hall. UK) aiul databases from the National Center for Genome Research (Santa Fe, 
N.Mex.). However, as any sample of a plurality of DNA sequences of any provenance can be analyzed 
by the methods of this invention, any database containing entries for the sequences likely to be present in 
5 such a sample to be analyzed is usable in the further steps of the computer methods. 

A second analysis method uses the previously selected database of sequences likely to be present 
in a sample and a description of an intended experiment to derive a pattern of the signals which will be 
produced by DNA fragments generated in the experiment. This pattern can be stored in a computer 
implementation in any convenient manner. In the following, without limitation, it is described as bein<j 
1 0 stored as a table of information. This table may be stored as individual records or by usine a database 
system, such as any conventionally available relational database. Alternative!} , the pattern may simply 
be stored as the image of the in-memory structures which represent the pattern. 

A second important embodiment of this method, i.e., determining the presence of the fragments 
and their quantitation, as described in step (b) above, is carried out by a process that includes the steps as 

1 5 follow. First, for each pair of nucleotide subsequences selected, a pair of oligonucleotide primers are 
provides, the pair consisting of a first primer and a second primer, wherein the first primer is 
complementary to the tlrst subsequence and the second primer is complementary to the second 
subsequence. Second, the nucleotide sequence between the first subsequence and the second 
subsequence are amplitled using the oligonucleotide primers to prime the amplification. ihereb\ 

20 providing an amplicon characterized by the subsequence pair, a length between the two subsequences 
corresponding to the two primers employed in each pair and a quantitation of the extent to which each 
amplicon is present. Third, output signals are generated as above for each amplicon, each output sianal 
characterizing (a) the subsequences of the pairs of primers, (b) the length, and (c) the quantitation. 
Optionally, a nucleotide sequence database ma\' be searched for sequences that are predicted to produce. 

25 or alternatively, not produce, the one or more output signals generated by the nucleic acid from the ceils 
of each class, given the parameters described above. Analysis methods are as described above. 

This invention can be applied, for example and not by way of limitation, to /// vitro cell 
populations or cell lines, to in vivo animal models of disease or other processes, to human samples, to 
purified cell populations perhaps drawn from actual wild-type occurrences, and to tissue samples 
30 containing mi.xed cell populations. The cell or tissue sources can advantageouslv be a plant, a single 
celled animal, a multicellular animal, a bacterium, a virus, a fungus, or a \east. etc. The animal can 
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advantageously be laboratory animals used in research, such as mice engineered or bred to have certain 
genomes or disease conditions or tendencies. 

Cells used in the invention ma> be obtained from a mammaL preferably a human, having or 
suspected of having a diseased condition. In one embodiment, the diseased condition is a malignanc) . 
5 The in vitro cell populations or cell lines can be exposed to various exogenous factors lo detenu ine the 
effect of such factors on gene expression. In a preferred embodiment, the e.xogenous factor is a putative 
phamiaceutical agent. Cells so contacted with a putative pharmaceutical agent are treated with an 
amount of the agent sufficient to effect a change in the state of those cells or with an amount of the agent 
less than or equal to a predetermined upper limit of dosing concentration, prior to their being assayed. 
10 Measures of relatedness and extent of correlation may be made between cells so contacted with putative 
phamiaceutical agent and. for example, cells not so contacted. 

Extent of relatedness methodology 

The present invention provides a representation of the extent of relatedness between a tlrst class 
of cells and a second class of cells. 1 he cells in each class are chosen from among cells of a given cell 

1 5 type, cells from a given tissue, and cells from a given organ, as described above. The extent of 

relatedness reflects similarities or differences in the presence of pairs of nucleotide subsequences, each 
pair consisting of a first subsequence and a second subsequence, in a nucleotide length separating the 
first and second subsequences of the pair and in a quantitation of the extent to which each pair hav ing the 
determined length is in the classes of cells. Input information of the fragments to be analyzed are 

20 obtained by methods of nucleic acid analysis and quantitation as described in the NdCi Kic ACID .^SSA^ s 
section above. 

The measure of relatedness is provided by calculating a distance that rellccts the amplitude of a 
difference vector. A difference vector is defined as a difference between a first vector and a second 
vector. Herein, the first vector reflects information derived from the quantitation for each subsequence 
25 pair obtained for the first class of cells, and correspondingly, the second vector reflects the analogous 

information derived from the second class. The different elements of each vector relate to data obtained 
using different subsequence pairs. 

In an embodiment of the representation, the extent of relatedness is related to a distance. This 
distance reflects the amplitude of a difference vector that is a difference between a first vector which 
30 reflects information derived from the quantitation for each subsequence pair obtained for the first class 
and a second vector which reflects the corresponding informal ion obtained for the second class. The 
different elements of each vector relate to data obtained using dilTerent subsequence pairs. 
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In an additional signiUcant embodiment, the representation includes a tree structure retlectii 
the extent of'relatedness is provided by generating a tree structure reHecting the relatedness between any 
two classes. The branches of the tree structure reflect the difference vectors and arc ramified from 
nodes. 

In important embodiments of the representation of the extent of relatedness. the representation is 
obtained employing the methods of the invention, including the methods that have been summarized in 
the paragraphs immediately above. 

In additional significant embodiments of the representation of the extent of relatedness. the cells 
in at least one class are obtained as described in the NucLi:iC ACID AN.VLYSIS section above. 

Correlation analysis methodology' 

The in\ention also provides a method for generating a representation of tiic correlation between 
a Hrst class of cells and a second class of ceii.s. The correlation rcHects a change in the nature and 
amount of nucleic acids present in the classes. In this method, the cells in each class are chosen from 
among cells of a given ceil type, cells from a given tissue, and cells from a given organ. The method of 
nucleic acid analysis and quantitation are as describe in the NlJCLFIC AciD ASSAYS section above. 

Upon generation of a signal output, the correlation between the cells of the first class and cells of 
the second class are correlated, and a representation of the correlation is prepared. The quantitation of 
the fragments in the invention corresponding to the RE pair u.sed in a given run and the length of each 
fragment so generated; thereby providing a quantitative measure of the extent to which the nucleic acid 
present in the cells in each class contains fragments iiaving the specific subsequence pairs and the 
nucleotide length between the pairs. 

In a significant embodiment of the method for generating a representation of the correlation, the 
correlation is related to a set of orthonormal eigenvectors, as described in the DlSl ANCKs section above. 
The elements of the basis set upon uhich the eigenvectors are constructed refiect particular biochemical 
or physiological pathways correlated between the cells of the tuo classes Each of these eigen\ ectors is 
associated with an eigenvalue that is an integer greater than zero. After defining an upper limit of the 
eigenvalues to be used, the coefficients of the basis set elements in each eigenvector whose eigenvalue is 
less than or equal to this upper limit refiects the contribution of the corresponding pathw av lo the 
biochemical or physiological differences correlated between the cells of the first class and the cells of the 
second class. 
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In another signitlcnnt en,bodi,nent of the method for j^enerating a representation of the 
correlation, the representation is a ch.stcr diagram or a dendrogram, and incItKles a tree structure 
reflecting the relatedttess of .he pathways involved in the biochen.ica! or physiological response to a 
difference between cells of the tuo classes. In obtaining this representation, a correlation matri.x is 
calculated that provides a distance determination in which the distance reflects the amplitude of a 
difference vector. This vector is a difference between two vectors each of which reHects information 
obtained for the response of one of the tvvo classes to the difference between the classes, and w herein th 
branches of the tree structure reflect the difference vectors and the branches are ramified from node 

In additional significant embodiments of the representation of the extent of correlation, the cell 
in at least one class obtained as described in the Ni =CLEIC ACIl) An.ai vsis section above. 
Display mean.s 

The present invention also provides a display „,eans display.ng a representation of the extent of 
relatedness between a first class of cells and a second class of cells. The cells in each class are chosen 
from among cells of a given cell t> pe. cells from a given ti.ssuc. and ceils (ron, a given oruan. as 
described above. The extent of relatedness reflects sin,ilnrities or differences in the presence of pairs of 
nucleotide subsequences, each pair consisting of a first subsequence and a second subsequence, in a 
nucleotide length separating the fust and second subsequences of the pair and in a quantitation of the 
extent to which each pair having the determined length is in the classes of cells. 

In a significant embodiment of the display means, the e.xtent of relatedness is related to a 
distance. This distance reflects the amplitude of a difference v ector that is a difference between a first 
vector which reflects information derived fro.n the quantitation for each subsequence pair obtained for 
the first class and a second vector w Inch reflects the corresponding intorn,a.ion obtained for the second 
class. The different elen.ents of each vector relate to data obtained us.ng different subsequence pairs. 

In an additional significant embodiment of the display means, the representation includes a tree 
structure reflecting the relatedness betw een any two classes, in which the branches of the tree structure 
reflect the difference vectors and the branches are ramified from nodes. 

In important embodiments of the displav n.eans displav ing a representation of the e.xtent of 
relatedness. the representation is obtained employing the methods of the invention, including the 
methods that have been summarized in the paragraphs immediately abov e. 
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In additional sigtiificant embodiments of the display means displaying a representation of the 
extent of relatedness. the cells in at least one class obtained as described in the NUCLEIC ACID Analysis 
section above. 

The present invention additionally provides a display means displaying a representation of the 
correlation between a first class of cells and a second class of cells. The cells in each class are chosen 
from among cells of a given cell type, cells from a given tissue, and cells from a given organ, as 
described above. The correlation reflects differences between the first class and the second class in the 
presence of a pair of nucleotide subsequences, each pair consisting of a first subsequence and a second 
subsequence and the nucleotide length separating the first and second subsequences of the pair, and in a 
quantitation of the extent to which each pair having the determined lengtfi is present in the cells. 

In an advantageous embodiment of this display means, the correlation is related to a set of 
onhonormal eigenvectors. The elements of the basis set upon which the eigenvectors are constructed 
reflect particular biochemical or ph\ siological pathwa\ s correlated between the cells of the two classes 
Each of tiicse eigenvectors is associated with an eigenvalue that is an integer greater than zero. After 
defining an upper limit of the eigenvalues to be used, the coefficients of the basis set elements in each 
eigenvector u hose eigenvalue is less than or equal to this upper limit reOect the contribution of the 
corresponding pathwa\ to the biochemical or physiological differences correlated between the cells of 
the first class and the cells of the second class. 

In an additional advantageous embodiment of the display means displaying a representation of 
the correlation, the representation is a cluster diagram or a dendrogram, and includes a tree structure 
reflecting the relatedness of the pathways involved in the biochemical or ph\ siological response to a 
difference between cells of the two classes. In obtaining this representation, a correlation matrix is 
calculated that provides a distance determination in wliich the distance rellects the amplitude of a 
difference vector. This vector is a difference between two vectors each of which reflects information 
obtained for the response of one of the two classes to the difference between the classes. The branches 
of the tree structure reflect the difference vectors and the branches are ramilled trom nodes. 

In important embodiments of the display means displaying a representation of the correlation, 
the representation is obtained emplos ing the methods of the in\ention. including the methods that have 
been summarized in the paragraphs immediateK abo\e. 

In miportant embodiments of the representation of tlie correlation, the representation is obtained 
emploN ing the methods of the invention, including the methods tliat hn\e been summarized in tiie 
paragraphs im media teiv abo\e. 



wo 00/15851 , 23 - PCT/US99/2I525 

In additional significnnt cinlxKlimcnls of the displax means displa\ ing a representation of the 
correlation, the cells in at least one class obtained as described in the Nl iCLKIC ACI[) Analysis section 
above. 

Other Aspects 

In addition to providing; representations of cells, tiie techniques described here are also useful for 
providing representations of nucleic acid fragments or genes. The staning point for the analysis is the 
matrix I,j described previously, where s labels the sample (or group of samples or distinct types of cells) 
and d labels a particular measurement of the expression level of a particular gene in that class. Rather 
than generating representations based on the rows of K each representing a different sample or group of 
samples, it is possible to generate representations based onjhe columns of!, each representing a different 
nucleic acid. Hierarchical and geometrical representations of nucleic acids, based on their relative 
abundance across a series of cells, can be used to infer genes that are co-expressed and are likely to have 
related biological function. 

Other Embodiments 

The data inatrix of intensities I can be described more generally as a representation in which 
each row corresponds to a particular biological sample or group of samples, and each column 
corresponds to a particular nucleic acid molecule or class of molecules w hose quantities are measured in 
each of the biological states. 

In addition to the differential-display methods described to pro\ idc measurements of nucleic acid 
quantities, other methods for obtaining measurements of the nucleic acids present in a cell are available. 
These include restriction fragment length pol\ morphism. amplification fragment length polymorphism. 
EST sequencing, serial analysis of gene expression, hvbridization to oligonucleotide probes, and other 
methods known in the an. Other methods, such as quantification b\ TaqMan or Northern blots, arc also 
used. AM of these methods generate data sets that can be analyzed according to the methods described 
here. The measurements 1,^ for each biological state and nucleic acid can correspond to absolute 
concentrations, concentrations relative to a standard (either ratio or numeric difference), or other 
convenient measures. 

The methods of the invention includes analysis of populations ranging from 5. 10, 25. 50. 100. 
1000. 10.000 or 100.000 or more members. 
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EXAMPLE 

Male Spraguc-Dawlcy mts (Marian Spragiic Davvley. inc.. Indianapolis, Indiana) of 10-14 weeks 
of age were gavauc-fed and dosed once a day for three days with the following drugs, dissolved in sterile 
• water, at ilie following levels; 

5 phenobarbitol 3.81 nig/kg/da> 

gabapentin 34.29 mg/kg/da\ 

vigabatrin 150 mg/kg/day 

paraldehyde 77.08 mg/kg/day. 

These dosages correspond to the EDIOO (the upper limit of the effective dose for hutnans) 
10 adjusted for the difference in metabolic rate between rats and humans. Three rats were used for each 

drug treatment, and an additional three rats to niaicii each drug were treated with sterile water to serve as 
a control. 

Rats were sacrificed 24 hours after tlie final dose and their brains were harvested. Collection of 
mRNA. synthesis of cDNA, and differential display protocols were carried out according to methods 
15 described in U. S. Patent No. 5.871,697 and Shinikets ci al. 1999 (Nature Biotechnology 17:798-803). 

The following steps were followed to analyze the differential display pattern: 

1 . The intensities 1^,(\) for each of the three animals treated w ith the same drug were combined 
into a single average ^^bere the subscript a labels the drug. The standard deviation s,,(\) was also 

computed for the measurements from the individual animals treated w ith the drug. 

20 2. The averages lj,(.\) and standard deviations s,,(x) for each drug were compared with the 

average 1 J\) and standard deviation s.^ix) for the sterile water control treatment. A difference at length 
.\ was marked if 

ABS(ln[I,,(x)/T,(.x)J)>ln(1.5) (5) 

and if the significance was smaller than 0.1 5 for a two-tailed t-iest w ith 

t = [l,„(x) - IJ,x)l / [ I 5,,(x)- -H s Jx)- 1^2]- (6) 

and infmite degrees of freedom. The difierence intensities marked according to this procedure 
may then be inspected by eye and visuall> significant differences may be retained. 
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3. For each of the dinerences cL dctlned by a restriction enzyme pair r and a position x, the 
intensity {..(x) - I,, was determined for each of the drug treatments, whether or not that partictilar 
treatment has a difference compared to the control. 

In this example, the final data matrix I,, has 8 rows: 1 row for each of the 4 drugs, and 1 row for 
each of 4 replicates of the water control data. The matrix has as many columns as the number of 
differences detected in the differential display pattern. 

The Pearson correlation coefficient C,, between the 8 classes of samples (4 drugs. 4 water 
controls) was determined using methods provided in the Detailed Description of the Invemion. If a data 
element for a particular difference was missing for a particular treatment, that difference did not 
contribute to the correlation coefnciein. The correlations are shown in the Table I below, with the 
standard deviation within a drug shown as the diagonal elements. 
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e 





vigaoainn 


DnenoDaroroi 


vtgabatnn 


26136C^ 


09982 


pnenobarbto' 


0 998; 


253 4469 


gabapentin 


0.99O 


0 9973 


paraidenyae 


0 9728 


0 9325 


waier_} 


0 6548 


0 5S78 


waier_2 


0 6573 


0 4724 


w3ter_3 


0 3674 


0 6569 


water_4 


0 3736 


0 660* 



0 99G 
0 9973 
556 3114 
0 9922 

06rr7 

0 5057 
0 34C0 
C 27C4 



paraldehyde 


wa;er_ i 


waier_2 


water_3 


water 4 


09728 


0 654~S 


0 6573 


0 3674 


0 3785 


0 9325 


0.5678 


04724 


0.55G9 


0 5601 


0 9922 


06T7r 


0 5057 


0 340C 


0 2704 


423.1916 


0 64S5 


0 5307 


0 5952 


0.57O2 


0 6485 


59 4573 


0 97 'S 


0 9896 


0 9735 


0.5307 


0 9715 


62.3568 


0 995C 


0 9885 


05951' 


0 9895 


0.9950 


107 2630 


0 9975 


0 5702 


0 9735 


0 9836 


0 9973 


123 0743 



Next the pairuisc Pearson distance was calculated as described previously. The distance matri.v 
is shown in Table 2 below. 

Table 2. 

vi;aDa:.'in oneno barmtc t gaoapen^m oaraldehyde wa(er'_i wate^ : 



water_3 Arater 4 



vigaoatnn 0 OOOC 0 0597 0 C6 02333 08309 OSS" , -MR 

Dheno=arD.lol 0C597 0 OOOC 00741 03675 09297 ,0^7-^ 0823- lai^- 

gaoapennr. ot3S 0074, o 0000 0 245 057.u o SSsJ ,-489 ,2080 

paralaehyoe 0 233i 0 3S75 0 t245 OOOOC 0 833. 0 9683 0 3^97 0 92- 

water.: 0 830S 0 9297 O 37« 0 838. OOOCO 0 237S 0 «46 0 2302 

. 08380 09688 0237^ o 0000 00895 0 KM 

0"'^ ^8997 0 U45 0 0895 00 0 OO^i 

"""-^ 0«^' 02302 0 =09 0.0663 oSoOO 

The distances were then used as input to a nearest-neighbor clustering algorithm. The resultin^ 
clusters, using sterile H;0 as an outgroup. was shown in Fig. The horizontal distances in Fig. I A were 
proportional to the pairw ise Pearson distance between clusters. 

The correlation niairi.x C,,, also ser\ed as the starting point tor principal factor analysis. First, 
principal components were calculated using the inner product matrix from multidimensional scaling 
B= HCH 
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where C is tlie correlation matrix and 1 1 is tfie centering matrix. The k'*' principal component is 
then the k'' eigenvector of B normalized to unit length and ordered by decreasing eigenvalue and the 
k'' principal factor was obtained h> scaling the eigenvector by Projections of the treatments and 
controls onto principal factors are shown in Table 3 below. 

Table 3. 



factor, 

eigenvalue 

vigabatrin 

phenobarbitoi 

ga&apenttn 

paraldehyde 

water_i 

water_3 

water_3 

water 4 



1 

1841 
-0 5X3 
-0 397 
-0 580 
-0 404 
0 368 
0 422 
0 542 
0 561 



2 

0.347 
-0.B3 
0 3T7 
-0.157 
0 27 
-0 «9 
-0.300 
0.156 
0 190 



3 

0.093 
-0.099 
-0 t32 
0.028 
0 193 
0 109 
-0.091 
0 046 
■005^ 



0 036 
0.083 
-0 038 
-0094 
0 057 

0 or7 
-oor7 

■0 090 
0.082 



5 

0 00? 
-0 004 

-o.os 

-0 004 
0O34 

•0 06C 
0 03S 
O.0t2 
0001 



6 

0 000 
0 000 
0.000 
0 000 
0 000 
0 000 

ooco 

0 000 

0 00-: 



7 

-0 036 
0 000 
0 000 
0.000 
0.000 

o.ooo 

0.000 
0 000 
0 000 



-0.389 
0 000 
0.000 
0.000 

oooc 
ooco 

0 000 
0.000 
0 000 



The components are ordered from I (most informative) to 8 (least informative^ The negative 
eigenvalues arise from tiie method used to account for missing data. If missing data had been handled in 
an alternate manner, for example if a missing element had been set to the average value or if the analysis 
were restricted to differences for which no data was missing, the eigenvalues would all be non-negative. 

In Fig. 4. the treatments are displa>ed by projection onto principal factors. Factor I 
discriminates between drugs, where it has a negative value, atid controls, where it has a positive value. 
Factor 2 discriminates between the druu treatments. 



EQUIVALE.NTS 

From the foregonig detailed description of the .specillc embodiments of the invention, it should 
be apparent that unique metiiods for representing the extent of relaiedness between cells, cell lines, 
tissues, organs, or expressed sequences based on a genomic analysis of gene expression have been 
described. Although particular embodiments have been di.sclosed herein in detail, this has been done b> 
wa\- of example for purposes of illustration onl>. and is not intended to be limiting with respect to the 
scope of the appended claims which follow. In panicular. it is contemplated by the inventor thai various 
substitutions, alterations, and modiHcalions may be made to the iiuention without departing from the 
spirit and scope of the invention as detlned b> the claims. For instance, the choice of source material, 
subsequences used, or software algorithm used is believed to be a matter of routine for a person of 
ordinar>' skill in the art with knowledge of the embodiments described herein. 
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CLAIMS 

I claim: 



I . A method for generating a representation of tlie extent of relatedness between at least two 
classes of cells, wherein the cells in each class arc chosen from the group consisting of cells of a given 
cell t>pe. cells from a given tissue, and cells from a given organ, the method comprising the steps of 

a) defining a plurality of pairs of nucleotide subsequences, each pair consistinu of a first 
subsequence and a second subsequence; 

b) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to which each fragment is present: and 

c) determining the extent of relatedness reficcting similarities or differences in the presence and 
quantitation of the fragments among the classes. 

2, The method described in claim I w herein tiie determining of the presence and quantitation of 
the fragments described in step b) is carried out by a process comprising the steps of; 

i) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restriction endonucleascs. each sample being treated hv one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridi/.ing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand ha\ ing no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementary to the overhang produced by one of the restriction endonucleascs of the pair, and 
(b) a longer strand having a 3" end complementary to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligatcd fragments, wherein each ligated fragment 
is capable of generating an output signal: 

ii) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleascs. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleascs (b) the length between the r\vo subsequences corresponding to the two restriction 
endonucleascs employed in each pair of nuciease.s. and the quantitation of the fragment 
corresponding to the pair and the length: and 
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iii) optionallv searching a miclcoticie sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce tiie one or more 
output signals produced by the nucleic acid from the cells of eacli class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when tiie sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that arc members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

3. The method described in claim 1 wherein the determining of the presence of tlie fragments and 
the quantitation of the fragments, described in step b) is carried out by a process comprising the steps of: 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first primer is complcmentar> to the first 
subsequence and the second primer is complementar>' to the second subsequence: 

ii) amplifying the nucleotide sequence between tiie first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the tuu 
primers employed in each pair and a quantitation of the extent to which each amplicon is present: and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation: and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as artr 
represented by said one or more output signals, or target nucleolide subsequences that are members of 
the same stets of target nucleolide subsequences rcprcseiiied h\ the one or more output signals. 
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thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

4. The method described in claim 1 wherein tiie exieu of relatedness in step c) is provided b>- 
calculating a distance wherein the distance renects the amplitude of a difference vector that is a 
difference between a first vector which refiects information derived from the quantitation for each 
subsequence pair obtained for the first class and a second vector which refiects information derived from 
the quantitation for each subsequence pair obtained for the second class, wherein different elements of 
each vector relate to data obtained using different pairs. 

5. The method described in claim 1 wherein the extent of relatedness in step c) is provided b> 
generating a tree structure reflecting the relatedness between any two classes, wherein the branches of 
the tree structure reflect the difference vectors and the branches are ramified from nodes. 

6. The method described in claim I wherein the cells in at least one class arc cancer cells. 

7. The method described in claim I wherein the cells in at least one class have been contacted 
with a putative pharmaceutical agent. 

8. A method for generating a representation of the correlation between a plurality of classes of 
cells wherein the cells in each class are chosen from the group consisting of cells of a given cell type, 
cells from a given tissue, and cells from a given organ, the correlation reflecting a change in the nature 
and amount of nucleic acids present in the classes, the method comprising the steps of: 

a) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

b) in the nucleic acid of each class of cells determining the presence of a fragment vsith the first 
subsequence at one end and the second subsequence at another end and having a length separated bv the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, therebx 
defining a difference between the classes: 

c) evaluating the correlation between the cells of the classes: and 

d) preparing a representation of the correlation. 
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9. The method described in claim 8 wherein the determining ofihc presence and quantitation of 
the fragments described in step b) is carried out b> a process comprising the steps of: 

i) digesting samples of the nucleic acid from the cells of each class with a plurality of specitlc 
pairs of restriction endonucleases. each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragtnents, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no 5* 
terminal phosphate and consisting of a first and second poaion. said first portion being at the 5' end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3* end complementary to the second portion of the shoaer strand, and 
ligating the longer strands to the fragments to produce ligatcd fragments, wherein each ligated fragment 
is capable of generating an output signal; 

ii) generating output signals from each ligated fragtnent for each of the pairs of restriction 
endonucleases. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when tiie sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specitlc subsequence pairs and the nucleotide length 
between the pairs. 

10. The method described in claim 8 uiierein the determining of the presence of the lra<:ments 
and the quantitation of the tVagments. described in step h) is carried out by a process comprising the step^ 
of: 
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i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence; 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an aniplicon 
characterized by the subsequence pair, a length between the two subsequences correspondinu to the two 
primers employed in each pair and a quantitation of the extent to which each amplicon is present; and 

iii) generating output signals for each amplicon, each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation: and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of know n nucleotide sequences of nucleic acids thai may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output sigiials, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same siets of target nucleotide subsequences represented by the one or more output signals, 

thereby providing a quantitative measure of the extent to w hich the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

! 1. The method described in claim iS wherein the correlation in step d) is related to a set of 
orthonorma! eigenvectors, the elements of the basis set upon which the eigen\ectors are constructed 
retlecting particular biochemical or physiological pathways correlated between the cells of the two 
classes, each eigenvector having an eigenvalue that is an integer greater than zero, the coeftlcients of the 
basis set elements in each eigenvector whose eigenvalue is less than or equal to a panicular integer that 
is an upper limit of the eigenvalues used retlecting the contribution of the corresponding pathwav to the 
biochemical or physiological differences correlated between the cells of the first class and the cells of the 
second class. 
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12. Tlie method described in claim 8 wherein tlie representation is a cluster diagram era 
dendrogram, includes a tree structure reflecting the relatedness of the pathways involved in the 
biochemical or physiological response to a difference between ceils of the two classes, w herein a 
correlation matrix provides a distance determination wherein the distance reflects the amplitude of a 
difference vector that is a difference between two vectors each of which reflects infonnation obtained for 
the response of one of the two classes to the difference, and wherein the branches of the tree structure 
reflect the difference vectors and the branches are ramiiled from nodes. 

13. The method described in claim 8 wherein the cells in at least one class are cancer cells. 

14. The method described in claim 8 wherein die cells in at least one class have been contacted 
with a putative pharmaceutical agent, and the method comprises the steps of: 

a) treating the cells of at least one class w ith an amount of the agent sufficient to effect a chanse 
in the state oftho.se cells or with an amount of the agent less than or equal to a predetemiined upper limit 
of dosing concentration; 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; 

c) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to w hich each fragment is present, therebs 
defining an effect of the agent; 

d) evaluating the correlation between the effect of the agent on the cells of the first class and the 
effect of the agent on the cells of another class; and 

e) preparing a representation of the correlation. 

15. A display means displaying a representation of the extent of relatedness between at least two 
classes of cells, w herein the cells in each class are chosen from the group consisting of cells of a uiven 
cell t\ pe. cells from a given tissue, and cells from a gi\ en organ, the extent of relatedness reflectinu. in 
the nucleic acids of the classes of cells, similarities or difterences in the presence of pairs of nucleotide 
subsequences, each pair consisting of a first subsequence and a second subsequence, a nucleotide lenizth 
separating the first and second subsequences of the pair and a quantitation of the extent to which each 
pair having the determined length is in the clashes of cells. 
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16. The displa> means described in claim 15 wherein the extent of relatedness is related to a 
distance wherein the distance reflects the amplitude of a difference vector that is a difference between a 
first vector which retlects information derived from the quantitation for each subsequence pair obtained 
for the Hrst class and a second vector which rertects information derived from the quantitation for each 
subsequence pair obtained for the second class, w herein different elements of each vector relate to data 
obtained using different pairs. 

17. The display means described in claim 15 wherein the representation includes a tree structure 
reHeciing the relatedness between any two classes, and wherein the branches of the tree structure reflect 
the difference vectors and the branches are ramified from nodes. 

1 8. The display means described in claim 15 wherein the extent of relatedness is obtained by a 
process comprising the steps of 

a) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; 

b) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to w hich each fragment is present; and 

c) determining the extent of relatedness refiecting similarities or differences in the presence and 
quantitation of the fragments among the classes 

10. The display means described in claim 18 wherein the determining of the presence and 
quantitation of the fragments described in step b ) is carried out b> a process comprising the steps of: 

i) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restriction endonucleases, each sample being treated b> one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shoaer strand having no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementarv to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal; 
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ii) generatinsi output signals from each ligatecl fragment for each of the pairs of restriction 
endonucleascs. each output signal characterizing (a) the subsequences of the pairs of restriction 
endoiuicleases (b) the length between the two subsequences corresponding to the tvvo restriction 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length; and 

iii) optionally searching a nucleotide sequence database to detennine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced b> the nucleic acid from the cells of each class, the database comprisin^ a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
i.s represented by the one or more output signals, and (b) the same target nucleotide sub.sequence as are 
represented bv said one or more output signals, or target nucleotide subsequences that are members ol 
the same stets of target nucleotide subsequences represented b> the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide lenmh 
between the pairs. 

:0. The display means described in claim 18 wherein the determining of the presence of the 
fragments and the quantitation of the fragments, described in step b) is carried out by a process 
comprising the steps of: 

1) for each pair of nucleotide subsequences pro\ iding a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first primer is complementar> to the first 
sub.sequence and the .second primer is complementar\- to the second subsequence: 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized bv the subsequence pair, a length between the two subsequences corresponding to 

the two primers emploved in each pair and a quantitation of the extent to which each amplicon is 
present; and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
.subsequences of the pairs of primers, (b) the lengtli. ;md (c) the quantitation; and 
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iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or tlie absence ofany sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells ofeach class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells ofeach class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of ' 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the speciHc subsequence pairs and the nucleotide length between the pairs. 
21. The display means described in claim 15 wherein the cells in at least one class are cancer 



cell 



s. 



22. The display means described in claim 15 wherein the cells in at lea.st one class have been 
contacted with a putative pharmaceutical agent. 

23. A displav means displav ing a representation of the correlation between a plurality of classes 
of cells, wherein the cells in each class are chosen from the group consisting of cells of a given cell type, 
cells from a given tissue, and cells from a given organ, the correlation rellecting. in the nucleic acids of 
the classes of cells, differences in the presence of a pair of nucleotide subsequences, each pair consisting 
of a first subsequence and a second subsequence and the nucleotide length separating the first and second 
subsequences of the pair, and a quantitation of ihe extent to which each pair having the determined 
length is present in the cells, between the classes. 

24. The display means described in claim 2.^ wherein the correlation is related to a set of 
orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors are constructed 
refiecting particular biochemical or physiological pathways correlated between the cells of the two 
classes, each eigenvector ha\ ing an eigenvalue that is an integer greater than zero, the coefficients of the 
basis set elements in each eigenvector whose eigenvalue is less than a particular integer that is chosen to 
be an upper limit of the eigenvalues rellecting the contribution of the corresponding pathway to the 
biochemical or pinsiological differences correlated between the cells of the first class and the cells of the 
second class. 
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25. The display means described in claim 23 wiiercin the representation is a cluster dianrain or a 
dendrogram and includes a tree structure reflecting the rclatedness of the pathways involved in the 
biochemical or physiological difference between cells of the two classes, wherein a correlation matrix 
provides a distance determination wherein the distance reflects the amplitude of a difference vector that 
is a difference between two vectors each of which reflects information obtained for the difference 
between the classes, and wherein the branches of the tree structure reflect the difference vectors and the 
branches are ramified from nodes. 

26. The display means described in claim 23 wherein tiic correlation is obtained by a method 
comprising the steps of 

a) dct Ining a plurality oi pairs of nucleotide subsequences, each pair consistinu of a first 
subsequence and a second subsequence: 

b) ) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to w hich each fragment is present. thereb\ 
defining a difference between classes: 

c evaluating the correlation betw een the cells of one class and the cells of a second class based 
on the difference between them: and 

d) ) preparing a representation of the correlation. 

27. The display means described in claim 23 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out b\ a process comprising the steps of: 

i) digesting samples of the nucleic acid tVom the cells of eacli class w ith a pluralitx of specific 
pairs of restriction endonucleases, each sample being treated b> one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments. h\bridizing double stranded adapter DNA 
molecules to the fragments, each adapter DXA molecule comprising (a) a shorter strand having no 5" 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5* end and 
being complementary to the overhang produced h> one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complementary to the <icco\K\ portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fraument 
is capable of generating an output sigr\al: 
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ii) generating output signals from eacl, ligatcd fragment for each of the pairs of restriction 
endonucloases. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonuclcases (b) tlie length between the two subsequences corresponding to the two restrictio., 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of know n nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequer.ces as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented b> the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the speciHc subsequence pairs and the nucleotide length 
between the pairs. 

28. The display means described in claim 2? wherein the determining of the presence of the 
fragments and the quantitation of the fragments, described in step b) is carried out by a process 
comprising the steps of: 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a Urst primer and a second prnner. vs herein the first primer is complementary to the first 
subsequence and the second primer is complementary to the second sub.sequence: 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the ampliHcation. providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers employed in each pair and a quantitation of the extent to which each amplicon is present: and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation: and 
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iv) optionally searchiniz a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids tiiat may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occuiTcnces of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby prov iding a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

20. The display means described in claim 23 wherein the cells in at least one class are cancer 

cells. 

30. The display means described in claim 23 wherein the cells in at least one class have been 
contacted uith a putative pharmaceutical agent, and the correlation is obtained by method comprising the 
steps of 

a) contacting the cells of at least one class with an amount of the agent sufficient to effect a 
change in the state of those cells or with an amount of the agent less than or equal to a predetermined 
upper limit of dosing concentration; 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; 

c) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated b> the 
first and second subseqtiences. and a quantitation of the extent to which each fragment is present. ihereb> 
defining an effect of the agent; 

d) evaluating the correlation between the effect of the agent between the cells of at least one 
class contacted with the agent and the cells of another class; and 

e) preparing a representation of the correlation. 
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31. A representation of the extent of relatedness between at least two classes of cells, wherein tli 
cells in each class are chosen from the group consisting of cells of a given cell t> pe. cells from a given 
tissue, and cells from a given organ, the extent of relatedness reflecting, in the nucleic acids of the 
classes of cells, similarities or differences in the presence of pairs of nucleotide subsequences, each pair 
consisting of a first subsequence and a second subsequence, a nucleotide length separating the first and 
second subsequences of the pair and a quantitation of the extent to which each pair having the 
determined length is in the classes of cells. 

32. The representation described in claim 31 wherein the extent of relatedness is related to a 
distance wherein the distance reflects the amphiudc of a difference vector that is a difference between a 
first vector which reflects information derived from tiic quantitation for each subsequence pair obtained 
for the first class and a second vector which reflects information derived from the quantitation for each 
subsequence pair obtained for the second class, w iierein different elements of each vector relate to data 
obtained using different pairs. 

33. The representation described in claim 3 1 wherein the representation includes a tree structure 
reflecting the relatedness between any two classes, and wherein the branches of the tree structure reflect 
the difference vectors and the branches are ratnifled from fiodes. 

34. The representation described in claim 3 1 wherein the extent of relatedness is obtained b> a 
process comprising the steps of 

a) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

b) in the nucleic acid of each class of cells determining the presence of a fragment w itii the first 
subsequence at one end and the second subsequence at another end and liaving a length separated by the 
first and second subsequences, and a quantitation of the extent to w hicii each fragment is present: and 

c) determining the extent of relatedness reflecting similarities or differences in the presence and 
quantitation of the fragments among the classes 
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35. The representation described in claim 34 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out by a process comprising the steps of: 

i) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restrictiuii cndonucleases, each sample being treated by one pair, one nuclease of the pair 
targeting tlie fu st subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
lisating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal: 

ii) uenerating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of eacii class, the database comprising a 
plurality of know n nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

therebv providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

36. The representation described in claim 34 wherein the determining of the presence of the 
fraizments and the quantitation of the fragments, described in step b ) is carried out b> a process 
comprising the steps of: 



wo 00/15851 .4,. PCT/US99/2I525 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the Hrst primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence; 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to 

the two primers employed in each pair and a quantitation of the extent to which each amplicon is 
present; and 

iii) generating output signals for each amplicon. each output signal characterizing (a) the 
subsequences of the pairs of primers, (b) the length, and (c) the quantitation; and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of know n nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stcts of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

37. The representation described in claim 3 i wherein the cells in at least one class are cancer 

cells. 

38. The representation described in claim 3 I wherein the cells in a class have been contacted 
with a putative pharmaceutical agent. 
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38. A representation of the correlation between a plurality of classes of cells, wherein the cells ii 
each class are chosen from the group consisting of cells of a given cell type, cells from a given tissue, 
and cells from a given organ, the correlation reflecting, in the nucleic acids of the classes of cells, 
differences in the presence of a pair of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence and the nucleotide length separating the first and second 
subsequences of the pair, and a quantitation of the e.Ntent to which each pair having the determined 
length is present in the cells, between the classes. 

39. The representation described in claim 38 wherein the correlation is related to a set of 
orthonormal eigenvectors, the elements of the basis set upon which the eigeuN ectors are constructed 
reflecting particular biochemical or physiological pathways correlated betueen the cells of the two 
classes, each eigenvector having an eigenvalue that is an integer greater than zero, the coefficients of the 
basis set elements in each eigenvector whose eigenvalue is less than a particular integer that is cho.sen to 
be an upper limit of the eigenvalues reflecting the contribution of the corresponding pathway to the 
biochemical or physiological differences correlated between the cells of the first class and the cells of the 
second class. 

40. The representation described in claim 38 wherein the represen ition is a cluster diagram or a 
dendrogram and includes a tree structure retlccling the relaiedness of the pathwavs involved in the 
biochemical or physiological differences between cells of the two classe.s. wherein a correlation matri.x 
provides a distance determination wherein the distance refiects the amplitude of a difTerence vector that 
is a difference between two vectors each of which reflects information obtained from one of the classes, 
and wherein the branches of the tree structure reflect the difference vectors and the branches are 
ramified from nodes. 

41. The representation described in claim 38 wherein the correlation is obtained by a method 
comprising the steps of 

a) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

b) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the e.Ktent to w hich each fragment is present, therebs 
defining a dif ference between classes; 
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c) evaluatiniz the correlation between the cells ot'ont class and tlic cells of a second class based 
on the difference between them: and 

d) preparing a representation of the correlation. 

42. The representation described in claim 41 wherein the determining of the presence and 
quantitation of the fragments described in step b) is carried out by a process comprising the steps of: 

i) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restriction endonucleases. each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (a) a shorter strand having no 5* 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5' end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(b) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
ligaiing the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal: 

ii ) generating output signals from each ligated fragment for each of ihe pairs of restriction 
endonucleases. each output signal characterizing (a) the subsequences of the pairs of restriction 
endonucleases (b) the length between the two subsequences corresponding to the two restriction 
endonucleases emplo\ ed in each pair of nucleases, and (c) the quantitation of the fragment 
corresponding to the pair and the length: and 

iii) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of anv sequences that are predicted to produce the one or niore 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (b) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to uhich the nucleic acid present in the 
cells in each class contains fragments ha\ inu the specific suh>equence pairs and the nucleotide lenuth 
between the pairs. 
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43. The representation described in claim 41 wherein the dciennining of the presence of the 
fragments and the qiianiiiaiion of the fragments, described in step c) is carried out by a process 
comprising the steps of: 

i) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, w herein the first primer is complementary to the first 
subsequence and the second primer is coniplementar\ to the second subsequence; 

ii) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences correspondinu to the two 
primers employed in each pair and a quantitation of the extent to w hich each amplicon is present: and 

iii ) generating output signals for each amplicon, each output signal characterizinu (a) the 
subsequences of the pairs of primers, (b) the lenglii, and (c) the quantitation: and 

iv) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced b\ the nucleic acid from the cells of each class, the database comprisinu a 
plurality of known nucleotide sequences of nucleic acids that ma\ be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (a) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (a) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that arc members of 
the same stets of target nucleotide stibsequences represented by the one or more output siunals, 

thereby providing a quantitaii\e measure of the extent to uhich the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

44. The representation described in claim 38 wherein the ceils in at least one class are cancer 

cells. 

45. The representation described in claini 38 wherein the cells in at least one class have been 
contacted w ith a putative pharmaceutical agent, and the correlation is obtained by a method comprisinu 
the steps of 

a) contacting the cells of at least one class uiih an amount of the agent sufficient to effect a 
change in the state of those cells or with an amount otdie agent less than or equal to a predetermined 
upper limit of dosing concentration: 



wo 00/15851 - 45 - PCT/US99/21525 

b) defining a plural ily of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

c) in the nucleic acid of each class of cells detennining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and hav ing a length separated by the 
first and second subsequences, and a quantitation of the extent to which eacii fragment is present, thereby 
defining an effect of the agent; 

d) evaluating the correlation between the effect of the agent between the cells of at least one 
class contacted with the agent and the cells of another class: and 

e) preparing a representation of the correlation, 

46. A method for generating a geometrical representation between a plurality of classes of cells 
wherein the cells in each class are chosen from the group consisting of cells of a given cell type, cells 
from a given tissue, and cells from a given organ, the representation rellecting a change in the nature and 
amount of nucleic acids present in the classes, tiie method comprising the steps of: 

a) in the nucleic acid of each class of cells, assessing the presence and amount of a nucleic acid 
fragment thereby defining a difference between the classes: 

b) carrv ing out a geometrical analysis based on the differences between the cells of the classes: 

and 

c) preparing a representation of the results of the analysis. 

47. The method described in claim 46 wherein the geometrical representation is a result obtained 
b> a principal component analysis or a principal factor anal\ sis. 

48. The method described in claim 46 wherein assessing the presence and amount of a nucleic 
acid fragment described in step a) is carried out b> a process comprising the steps of: 

i) probing the nucleic acid of each class with a set of oligonucleotide probes specific for the 
fragment: and 

ii) determining the extent lo which each probe binds the nucleic acid: 

thereb>' providing an assessment of the presence and anioiuu of the nucleic acid fragment in the 

class. 

49. The method described in claim 4(i u herein as.^e^siIlg the presence and amount of a nucleic 
acid fragment dcscribeti in step a) is carried out b\ a [mwc^'. comprising the steps of:: 
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i) dctming a pluralit\ of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: and 

ii) in the nucleic acid of each class of cells dcterniininu the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and hav ing a length separated by the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, thereby 
defining the difference between the classes. 

50. The method described in claim 49 wherein assessing the presence and quantity of a nucleic 
acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restriction endonucleases. each sample being treated b\ one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments. h>bridi/ing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (Da shorter strand ha\ inu no 5" 
terminal phosphate and consisting of a first and second portion, said first podion being at the 5' end and 
being complementaiy to the o\ erhang produced by one of the restriction endonucleases of the pair, and 
(2) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each iigated fraizment 
is capable of generating an output signal: 

(b) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases. each output signal characterizing ( I ) the subsequences of the pairs of restriction 
endonucleases (2) the length between the two subsequences corresponding to the two restriction 
endonucleases employed m each pair of nucleases, and (3) the uuaniitaiion of the fragment 
corresponding to the pair and the length: and 

(c) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprisin^ a 
plurality of known nucleotide sequences of nucleic acids that ma> be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more outpm signals, or taiiici luicieoiide subsequences that are members of 
the same stets of target nucleotide subsequences represented b\ ilie one or more output siunals. 
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thereby providing a quantitative measure of the extent to wliich the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
betvseeii the pairs. 

51. The method described in claim 49 therein assessing the presence and quantity of a nucleic 
acid fragment described in step ii) is carried out by a process comprising the steps of 

(a) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer wherein the first primer is complementary to the first 
subsequence and the second primer is complementar\ to the second subsequence; 

(b) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the tuo subsequences corresponding to the t%vo 
primers employed in each pair and a quantitation of the extent to which each amplicon is present: and 

(c) generating output signals for each amplicon. each output signal characterizing ( 1 ) the 
subsequences of the pairs of primers. (2) the length, and 0) the quantitation; and 

(d) optionall> searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as arc 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific sub.sequence pairs and the nucleotide length betuecn the pairs. 

52. The method described in claim 46 wherein the results of the geometrical analysis are chosen 
from the group consisting of eigenvalues, eigenvectors, and principal factors. 
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53. The nietlKxi described in claim 46 wherein the results of analysis in step c) are related to a set 
of orthonormal eigenvectors, the elements of tiie basis set upon which the eigenvectors are constructed 
reflecting particular biocheniica!. ph> siological or pharmacological components correlated between the 
cells of the two classes, each eigenvector having an eigenvalue, the coefficients of the basis set elements 
in each eigenvector reflecting the contribution of the corresponding biochemical, physiological or 
pharmacological components to the differences between the cells of the first class and the cells of the 
second class. 

54. The method described in claim 46 wherein the ceils in at least one class are cancer cells. 

55. The method described in claim 46 wherein the cells in at least one class are contacted with a 
putative pharmaceutical agent, and the method comprises the steps of: 

a) treating the cells of at least one class with an amount of the agent sufficient to effect a change 
in the state of those cells or w ith an amount of the agent less than or equal to a predetermined upper limit 
of dosing concentration: 

b) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence: 

c) in the nucleic acid of each class of cells determining the presence of a fragment with the tlrst 
subsequence at one end and the second subsequence at another end and having a length separated bv the 
first and second subsequences, and a quantitation of the extent to w hich each fragment is present, therebv 
defining an effect of the agent: 

d) conducting a principal component analysis betw een the effect of the agent on the cells of the 
first class and the cells of another class: and 

e) preparing a representation of the results of the analysis. 

56. A display means displaying a geometrical representation between a plurality of classes of 
cells wherein the cells in each class are chosen from the group consisting of cells of a given eel! ivpe. 
cells from a given tissue, and cells from a given organ, the principal component analysis reflecting a 
change in the nature and amount of nucleic acids present in the classes, w herein the representation is 
obtained by a method comprising the steps of 

a) in the nucleic acid of each class of cells, assessing the presence and amount of a nucleic acid 
fragment thereby deHning a difference between the classes: 

b) carrying out a principal component anal\ sis based on the differences between the cells of the 
first class and the cells of the second class: and 
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c) preparing the represciUaiioii of tlie results of the analysis. 

57. The display means described in claim 56 wherein the geometrical representation is a result 
obtained by a principal component analysis or a principal factor analysis. 

58. The display jiieans described in claim 56 wherein assessing the presence and amount of a 
nucleic acid fragment described in step a) comprises the steps of: 

i) probing the nucleic acid of each class with a set of oligonucleotide probes specific for the 
fragment: and 

ii) determining the extent to which each probe binds the nucleic acid; 

thereby providing an assessment of the presence and amount of the nucleic acid fragment in the 

class- 

50. The display means described in claim 56 wherein assessing the presence and amount of a 
nucleic acid fragment described in step a) is carried out by a process comprising the steps of: 

i) defining a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; and 

ii) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to uhich each fragment is present, therebv 
detlning the difference between the classes. 

60. The display means described in claim 5^ wherein determining the presence and quantity of a 
nucleic acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) digesting samples of the nucleic acid from the cells of each class wiih a plurality ofspecitlc 
pairs of restriction endonucleases, each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising (I ) a shorter strand having no 5" 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5" end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(2) a longer strand having a 3' end complementar> to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal; 
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(b) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases, each output signal characterizing (1) the subsequences of the pairs of restriction 
endonucleases (2) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (3) the quantitation of the fragment 
corresponding to the pair and the length; and 

(c) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (1) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals, 

thereby providing a quantitative measure of the extent to w hich the nucleic acid present in the 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide length 
between the pairs. 

61 . The display means described in claim 59 wherein assessing the presence and quantity of a 
nucleic acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers, 
consisting of a first primer and a second primer, wherein the first primer is complementary to the first 
subsequence and the second primer is complementary to the second subsequence; 

(b) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the amplification, providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the tuo 
primers employed in each pair and a quantitation of the extent to w hich each amplicon is present: and 

(c) generating output signals for each amplicon. each output signal characterizing (1 ) the 
subsequences of the pairs of primers. (2) the length, and (3) the quantitation: and 
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(d) optionally scarchinu a nucleotide sequence database to detennine sequences that are 
predicted to oduce or the absence of any sequences that are predicted to produce tlie one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
plurality of known nucleotide sequences of nucleic acids that may be present in tlie cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
from the database has both (I ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleoside subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stets of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and tlie nucleotide length between the pairs. 

62. The display means described in claim 56 wherein the results of the anaK sis are chosen from 
the group consisting of eigenvalues, eigenvectors, and principal factors. 

63. The display means described in claim 56 wherein the results of the analysis in step c) are 
related to a set of orthonormal eigenvectors, the elements of the basis set upon which the eigenvectors 
are constructed reflecting particular biochemical, phvsiological or pharmacological components 
correlated betw een the cells of the tw o classes, each eigenvector having an eigenvalue, the coefficients of 
the basis set elements in each eigenvector reflecting the contribution of the corresponding biochemical, 
physiological or pharmacological components to the differences betw een the cells of the first class and 
the cells of the second class. 

64. The display means described in claim 56 wherein the cells in at least one class arc cancer 

cells. 

65. The display means described in claim 56 wherein the cells in at least one class ha\e been 
contacted with a putative pharmaceutical agent, and the representation is obtained by a method 
comprising the steps of: 

a) treating the cells of at least one class with an amount of the agent sufficient to effect a chanue 
in the state of those cells or with an amount of the agent less than or equal to a predetermined upper limit 
of dosing concentration: 

b) detming a plurality of pairs of nucleotide subsequence^, each pair consisting of a first 
subsequence and a second subsequence: 
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c) in the nucleic acid of each class of ceils determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to whicii each fragment is present, thereby 
defining an effect of the agent; 

d) conducting a principal component analysis between the effect of the agent on the cells of the 
first class and the cells of another class; and 

e) preparing the representation of the results of the analysis. 

66. A geometrical representation benveen a plurality of classes of cells wherein the cells in each 
class are chosen from the group consisting of cells of a given cell type, cells from a given tissue, and 
cells from a given organ, the principal component analysis rcHecting a change in the nature and amount 
of nucleic acids present in the classes, the representation obtained by a method comprising the steps of: 

a) in the nucleic acid of each class of cells, assessing the presence and amount of a nucleic acid 
fragment thereby defniing a difference between the classes; 

b) carrying out a principal component analysis based on the differences between the cells of the 
first cl ass and the ceils of the second class; and 

c) preparing the representation of the results of the analysis. 

67. The representation described in claim 66 wherein the geometrical representation is a result 
obtained by a principal component analysis or a principal factor anal> sis. 

68. The representation described in claim 66 wherein assessing the presence and amount of a 
nucleic acid fragment described in step a) comprises the steps of: 

i) probing the nucleic acid of eacli class with a set of oligonucleotide probes specific for the 
fragment; and 

ii) determining the e.\tcnt to which each probe binds the nucleic acid; 

thereby providing an assessment of the presence and amount of the nucleic acid fragment in the 

class. 
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69. The representation described in claim 66 wherein assessing the presence and amount of a 
nucleic acid fragment described in step a,* is carried out by a process comprising the steps of: 

i) defming a plurality of pairs of nucleotide subsequences, each pair consisting of a first 
subsequence and a second subsequence; and 

ii) in the nucleic acid of each class of cells determining the presence of a fragment with the first 
subsequence at one end and the second subsequence at another end and having a length separated by the 
first and second subsequences, and a quantitation of the extent to which each fragment is present, thereby 
defining the difference between the classes, 

70. The representation described in claim 69 wherein determining the presence and quantity of a 
nucleic acid fragment described in step ii) is carried out by a process comprising the steps of: 

(a) digesting samples of the nucleic acid from the cells of each class with a plurality of specific 
pairs of restriction endonuc leases, each sample being treated by one pair, one nuclease of the pair 
targeting the first subsequence and the second nuclease of the pair targeting the second subsequence, 
each digestion providing specific restriction fragments, hybridizing double stranded adapter DNA 
molecules to the fragments, each adapter DNA molecule comprising ( 1 ) a shoner strand having no 5' 
terminal phosphate and consisting of a first and second portion, said first portion being at the 5* end and 
being complementary to the overhang produced by one of the restriction endonucleases of the pair, and 
(2) a longer strand having a 3' end complementary to the second portion of the shorter strand, and 
ligating the longer strands to the fragments to produce ligated fragments, wherein each ligated fragment 
is capable of generating an output signal: 

(b) generating output signals from each ligated fragment for each of the pairs of restriction 
endonucleases, each output signal characterizing ( 1 ) the subsequences of the pairs of restriction 
endonucleases (2) the length between the two subsequences corresponding to the two restriction 
endonucleases employed in each pair of nucleases, and (3) the quantitation of the fragment 
corresponding to the pair and the length: and 
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(ci optionally searching a nucleotide set|itcncc database to dcterniinc sequences tliat are 
predicted to produce or the absence of any sequences that are predicted to produce the one or more 
output signals produced by the nucleic acid from the cells of each class, the database comprising a 
phiralitv of known nucleotide sequences of nucleic acids that may be present in tlie cells of each class, a 
sequence from the database being predicted to produce the one or more output signals uhen the sequence 
from the database has both ( I ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more output signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more output signals, or target nucleotide subsequences that are members of 
the same stcts of target nucleotide subsequences represented by the one or more output signals. 

thereby providing a quantitative measure of the extent to which the nucleic acid present in tiie 
cells in each class contains fragments having the specific subsequence pairs and the nucleotide lenuth 
between the pairs. 

71. The representation described in claim 6^ w herein assessing the presence and quanlitv of a 
nucleic acid fragment described in step ii ) is carried out by a process comprising the steps of: 

(a) for each pair of nucleotide subsequences providing a pair of oligonucleotide primers. 
consisting of a tlrst primer and a second primer, wherein the first primer is complementary to the first 
subsequence and the second primer is complementar\ to the second subsequence; 

(b) amplifying the nucleotide sequence between the first subsequence and the second 
subsequence using the oligonucleotide primers to prime the ampliUcation. providing an amplicon 
characterized by the subsequence pair, a length between the two subsequences corresponding to the two 
primers employed in each pair and a quantitation of tiie extent to which each amplicon is present: and 

(c) generating output signals for each amplicon, each output signal characteri/ing ( 1 ) the 
subsequences of the pairs of primers, (2) the length, and (3) the quantitation; and 

(d) optionally searching a nucleotide sequence database to determine sequences that are 
predicted to produce or the absence of an> sequences that are predicted to produce the one or more 
output signals produced b> the nucleic acid from the cells of each class, the database comprisinsi a 
plurality of knoun nucleotide sequences of nucleic acids that may be present in the cells of each class, a 
sequence from the database being predicted to produce the one or more output signals when the sequence 
fron\ the database has both ( 1 ) the same length between occurrences of target nucleotide subsequences as 
is represented by the one or more omput signals, and (2) the same target nucleotide subsequence as are 
represented by said one or more output signals, or tiirgci nucleotide subsequences that are members of 
the same sieis of target nucleotide subsequences [v tM ^.-Nciucd in the one or more output signals. 
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thereby providing a quantitative measure of the extent to which the nucleic acid present in the 
cells in each class contains the specific subsequence pairs and the nucleotide length between the pairs. 

12. The representation described in claim 66 wherein the results of the anal>sis are chosen from 
the group consisting of eigenvalues, eigenvectors, and principal factors. 

73. The representation described in claim 66 wherein the results ofthe analysis in step c) are 
related to a set of orthonormal eigenvectors, the elonents ofthe basis set upon which the eigenvectors 
are constructed renecting particular biochemical, physiological or pharmacological components 
correlated between the cells of the two classes, each eigenvector havmg an eigenvakie. the coefncients of 
the basis set elements in each eigenvector reflecting the contribution ofthe corresponding biochemical, 
physiological or pharmacological components to the differences between the cells ofthe tlrst class and 
the cells of tiie second class. 

74. The repre.sentation described in claim 66 wherein the cells in at least one class are cancer 

cells. 

75. The representation described in claim 66 wherein the cells in at least one class have been 
contacted vv ith a pmative pharmaceutical agent, and the representation is obtained by a method 
comprising the steps of: 

a) treating the cells of at least one class w ith an amount of the agent sufficient to effect a chance 
in the state of those cells or with an amount ofthe agent less than or equal to a predetermined upper limit 

of dosing concentration: 

b) dellning a pluraliiv of pa.rs of nucleotide subsequences, each pair consi.sting of a first 

subsequence and a second subsequence; 

e) in the nucleic acid of each class of cells determminu the presence of a fragment with the first 
subsequence at one end and the .second subsequence at another end and having a length separated bv the 
first and second subsequences, and a quantitation ofthe extent to which each fragment is present, therebv 
defining an effect ofthe agent: 

d) conductmg a principal component anaivsis between the effect of tiie agent on the cells ofthe 
first class and the cells of another class: and 

c) preparing the representation ofthe result> of tlic anaKsis. 
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76. A method for classifying: a pluralil\ of classes of cells or components thereof hierarchically 
comprising the steps of 

a) measuring relative differences in the quantity of a nucleic acid present in each class of cells to 
provide measurements of differential nucleic acid display; 

b) converting the measurements into distances between the classes of ceMs in a vector space: and 

c) preparing a hierarchical classincaiion amongst the classes based on the vector distances. 

77. The method of claim 76 wherein the classification is performed on classes of cells, wherein 
the cells in a class may be cells of a given cell type, cells from a given tissue, and cells from a given 
organ, ceils exhibiting a particular pathological state, or cells which have been contacted with a putative 
pharmaceutical agent. 

78. The method of claim 76 wherein the classification is performed on a component of the cells 
in the classes, wherein the component comprises a gene, a nucleic acid, or a fracment thereof. 

7^. The method of claim 76 wherein the measuring is carried out by a procedure chosen from 
the group consisting of differential displa> of nucleic acid fragments, probing for the presence of a 
nucleic acid using an oligonucleotide probe, sequences obtained from expressed sequence taus (HS Ts). 
assessing restriction fragment length poi\ morphisms. and assessing amplification fragment lencth 
polymorphisms 

80. The method of claim 76 wherein the preparation of the hierarchical classification is carried 
out by a procedure chosen from the group consisting of principal component analysis of a correlation 
matrix, principal factor analysis of a correlation matrix, principal component anal\sis of a centered inner 
product matrix, and prnicipal factor analysis of a centered inner product matrix. 

81. The method of claim 80 further comprising the step of obtaining a distance metric between 
the classes from a reduced dimensionality geometrical representation. 

82. A display means displaying the results of the classification obtained by a method described 
in any one of claims 76-81. 

S3. A method for representing a pluralit> of classes of cells or components thereof geometrically 
comprising the steps of 

a) ineasurmg relati\e differences in the quaniit> ofa nucleic acid present in each class of cells to 
provide measurements of differential nucleic acid dNpla\: ;i!u! 
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b) preparing a geometrical representation amongst the classes based on the measurement of the 
differential display. 

84. The method of claim 83 wherein the classincation is performed on classes of cells, wherein 
the cells in a class may be cells of a given cell type, cells from a given tissue, and cells from a given 
organ, cells exhibiting a particular pathological state, or cells which have been contacted with a putative 
pharmaceutical agent. 

.85. The method of claim 83 wherein the classification is perfomied on a component of the cells 
in the classes, wherein the component comprises a gene, a nucleic acid, or a fragment thereof 

86. The method of claim 83 wherein the measuring is carried out by a procedure chosen from 
the group consisting of differential display of nucleic acid fragments, probing for the presence of a 
nucleic acid using an oligonucleotide probe, sequences obtained from e.xpressed sequence tags (ESTs). 
a.ssessing restriction fragment length polymorphisms, and assessing amplification fragment length 
polymorphisms 

87. The method of claim 83 wherein the preparation of the hierarchical classification is carried 
out by a procedure chosen from the group consisting of principal component analysis of a correlation 
matri.x. principal factor analysis of a correlation matri.x. principal component analysis of a centered inner 
product matri.x. and principal factor analysis of a centered inner product matrix. 

88. The method of claim 87 further comprising the step of obtaining a distance metric between 
the classes from a reduced dimensionality geometrical representation, 

89. .-\ display mean.s displaying the results of the geometrical representation obtained by a 
method described in any one of claims 83-88. 

90. A method of presenting the hierarchical relatedness of two or more members of a 
population, the method comprising: 

providing a data set of each member in the population; 

generating a hierarchical classification of said data set; and 

displaying said cla.ssification. thereby presenting the hierarchical relatedness of the members of 
the population. 

91 . The method of claim 90. wherein said population is a population of cells. 

92. The method of claim 90. u herein said population ,s a population of nucleic acid sequence... 
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^j. The method of claim 90. wherein said population is a population of polypeptide sequences. 

94. The method of claim 90, wherein said hierarchical classification of any two or more 
members of the population is calculated using a distance method in combination with an algorithm. 

95. The method of claim 94. wherein said distance method is a Pearson correlation distance, 
Euclidean distance, Manhattan distance. Mahalanobis distance, a pairwise Pearson distance, or a 
Spearman distance. 

96. The method of claim 95, wherein said algorithm is single linkage, average hnkage. or 
complete linkage. 

97. The method of claim 90, w herein said data set is the product of an analysis of said members 
of the population that is selected from the group consisting of differential display, serial analysis of izene 
expression , e.xpression tagged sequence analysis, restriction fragment length polymorphism, amplilled 
fragment length polymorphism, or Northern blot hybridization analysis. 

98. A method of presenting the geometrical relatedness of two or more members of a 
population, the method comprising: 

prov iding a data set of each member in the population: 

generating a geometrical classification of said data set: and 

displaying said classification, thereby presenting the geometrical relatedness of the members of 
the population. 

99. The method of claim 98, wherein said population is a population of cells. 

100. The method of claim 98. wherein said population is a population of nucleic acid sequences. 
lOK The method of claim 98. wherein said population is a population of pol> peptide sequences. 

102. The method of claim 98. w herein said geometrical classification is generated bv analyziniz 
a matrix using an algorithm. 

103. The method of claim 102, w herein said matrix includes a correlation matrix. 

104. The method of claim 103, wherein said correlation matrix includes a Pearson correlation 
matrix, a Speariiian correlation matrix, or a pairw ise Pearson correlation matrix. 

105. The method of claim 102. wherein said matrix includes a centered inner product distance 

matrix. 
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106. The method of claim 105, wherein the inner product distance matrix is determined usinu a 
distance calculated by hierarchical classification analysis. 

107. The method of claim 102, wherein said algorithm includes principal component analysis. 

108. I he method of claim 102, wherein said algorithm includes principal factor analysis. 

109. The method of claim 107, wherein said algorithm includes principal factor anj, .sis. 

1 10. The method of claim 102, wherein said geometrical classification is further analyzed using 
hierarchical classification. 

111. The method of claim 90. wherein said population includes 5, 10. 25, 50, 100, 1000, 10,000, 
100.000 or more members. 

112. The method of claim 98. wherein said population includes 5. 10. 25, 50. 100, 1000. 10,000, 
100,000 or more members. 
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